For anyone who starts out with machine learning, one of the first algorithms that one would learn is linear regression.

So what is linear regression?

Linear regression is a simple, statistical model that allows users to find the value of a dependent variable, y, using an independent variable, x.  When you took pre-algebra, one of the concepts you learned were lines, especially graphing them.  If you need a refresher, here’s the formula:

\Large y = mx + b

Where m is the slope, b is the y-intercept, x is the input, and y is the value we’re trying to calculate.  If you took statistics, it’s possible to calculate m and b by using least-squares regression.  Here are the formulas to calculate m and b:

m = r\frac{s_y}{s_x}, b = \bar{y} - m\bar{x}

Where r is the correlation between x and y, s_x is the variance of x, s_y is the variance of y, \bar{x} is the mean of x, and \bar{y} is the mean of y.  It is also possible to extend the equation to support multiple independent variables:

y = \displaystyle\sum_{i=1}^{n} m_ix_i + b

In this case, the m_i is a parameter associated with x_i.

Relating this to Machine Learning

So, you mention linear regression, but how does this relate to machine learning?  Linear regression, unsurprisingly, is a regression-based algorithm.  A regression-based algorithm is where, given some feature(s), it attempts to predict a value.

A few examples that uses linear regression include:

  • Predicting the price of a house given features such as size, floors, rooms, and area.
  • A person using Capital Asset Pricing Model (CAPM) to evaluate the risk of an asset.

An Example

Before the algorithm can predict values, you must first train your algorithm by providing data.  There are several ways to train your algorithm, each with benefits and drawbacks.  For this example, we’ll use least-squares regression.

For example, suppose we had the following dataset for a group of students:

genderageYearageMonthHeightInWeightLB
f11.9166714356.385
f12.9166715562.3105
f12.7515363.3108
f13.416671615992
f15.9166719162.5112.5
f14.2517162.5112
f15.4166718559104
f11.8333314256.569
f13.333331606294.5
f11.6666714053.868.5
f11.5833313961.5104
f14.8333317861.5103.5
f13.0833315764.5123.5
f12.4166714958.393
f11.9166714351.350.5
f12.0833314558.889
f15.9166719165.3107
f12.515059.578.5
f12.2514761.3115
f1518063.3114
f11.7514161.885
f11.6666714053.581
f13.666671645883.5
f14.6666717661.3112
f15.4166718563.3101
f13.8333316661.5103.5
f14.5833317560.893.5
f1518059112
f17.521065.5140
f12.1666714656.383.5
f14.1666717064.390
f13.51625884
f12.4166714964.3110.5
f11.5833313957.596
f15.518657.895
f16.4166719761.5121
f14.0833316962.399.5
f14.7517761.8142.5
f15.4166718565.3118
f15.1666718258.3104.5
f14.4166717362.8102.5
f13.8333316659.389.5
f1416861.595
f14.083331696298.5
f12.515061.394
f15.3333318462.3108
f11.5833313952.863.5
f12.2514759.884.5
f1214459.593.5
f14.7517761.3112
f14.8333317863.5148.5
f16.4166719764.8112
f12.1666714660109
f12.083331455991.5
f12.2514755.875
f12.0833314557.884
f12.9166715561.3107
f13.9166716762.392.5
f15.2518364.3109.5
f11.9166714355.584
f15.2518364.5102.5
f15.4166718560106
f12.3333314856.377
f12.2514758.3111.5
f12.8333315460114
f1315654.575
f1214455.873.5
f12.8333315462.893.5
f12.6666715260.5105
f15.9166719163.3113.5
f15.8333319066.8140
f11.666671406077
f12.3333314860.584.5
f15.7518964.3113.5
f11.9166714358.377.5
f14.8333317866.5117.5
f13.6666716465.398
f13.0833315760.5112
f12.2514759.5101
f12.333331485995
f14.7517761.381
f14.2517161.591
f14.3333317264.8142
f15.8333319056.898.5
f15.2518366.5112
f11.9166714361.5116.5
f14.916671796398.5
f15.51865783.5
f15.1666718265.5133
f15.166671826291.5
f11.833331425672.5
f13.7516561.3106.5
f13.7516555.567
f12.8333315461122.5
f12.515054.574
f12.9166715566144.5
f13.5833316356.584
f11.751415672.5
f12.2514751.564
f17.521062116
f14.251716384
f13.916671676193.5
f15.1666718264111.5
f121446192
f16.0833319359.8115
f11.7514161.385
f13.6666716463.3108
f15.518663.5108
f14.0833316961.585
f14.5833317560.386
f1518061.3110.5
m13.7516564.898
m13.0833315760.5105
m1214457.376.5
m12.515059.584
m12.515060.8128
m11.5833313960.587
m15.7518967128
m15.2518364.8111
m12.2514750.579
m12.1666714657.590
m13.3333316060.584
m1315661.8112
m14.4166717361.393
m12.5833315166.3117
m11.7514153.384
m12.51505999.5
m13.6666716457.895
m12.751536084
m17.1666720668.3134
m20.8333325067.5171.5
m14.6666717663.898.5
m14.6666717665118.5
m11.6666714059.594.5
m15.4166718566105
m1518061.8104
m12.1666714657.383
m15.2518366105.5
m11.6666714056.584
m12.5833315158.386
m12.583331516181
m1214462.894
m13.3333316059.378.5
m14.8333317867.3119.5
m16.0833319366.3133
m13.516264.5119
m13.6666716460.595
m15.518666112
m11.9166714357.575
m14.583331756492
m14.5833317568112
m14.5833317563.598.5
m14.4166717369112.5
m14.1666717063.8112.5
m14.517466108
m13.6666716463.5108
m1214459.588
m1315666.3106
m12.416671495792
m1214460117.5
m12.251475784
m15.6666718867.3112
m14.0833316962100
m14.3333317265112
m12.515059.584
m16.0833319367.8127.5
m13.083331575880.5
m141686093.5
m11.6666714058.586.5
m1315658.392.5
m1315661.5108.5
m13.1666715865121
m15.3333318466.5112
m1315668.5114
m121445784
m14.6666717661.581
m1416866.5111.5
m12.4166714952.581
m11.833331425570
m15.6666718871140
m16.9166720366.5117
m11.8333314258.884
m15.7518966.3112
m15.6666718865.8150.5
m16.6666720071147
m12.6666715259.5105
m14.517469.8119.5
m13.8333316662.584
m12.0833314556.591
m11.9166714357.5101
m13.5833316365.3117.5
m13.8333316667.3121
m15.1666718267133
m14.4166717366112
m12.9166715561.891.5
m13.516260105
m14.7517763111
m14.7517760.5112
m14.5833317565.5114
m13.833331666291
m12.51505998
m12.515061.8118
m15.6666718863.3115.5
m13.5833316366112
m14.2517161.8112
m13.51626391
m11.7514157.585
m14.517463112
m11.833331425687.5
m12.3333314860.5118
m11.6666714056.883.5
m13.3333316064116
m121446089
m17.1666720669.5171.5
m13.2515963.3112
m12.4166714956.372
m16.0833319372150
m16.1666719465.3134.5
m12.6666715260.897
m12.166671465571.5
m11.583331395573.5
m15.518666.5112
m13.4166716156.875
m12.7515364.8128
m16.3333319664.598
m13.666671645884
m13.2515962.899
m14.8333317863.8112
m12.7515357.879.5
m12.9166715557.380.5
m14.8333317863.5102.5
m11.833331425576
m13.6666716466.5112
m15.7518965114
m13.6666716461.5140
m13.9166716762107.5
m12.5833315159.387

We can train our algorithm to predict the height of a person given their weight.  After training, our algorithm returned the following model for our dataset:

height = 0.157 * weight + 45.433

Now, suppose we want to know the height of a person who weight 150 pounds.  Based off of the model, we would get someone who is 69 inches (or 5′ 9″).

Pros and Cons

Linear regression has the following advantages:

  • Easy to understand.
  • Easy to code up.
  • Works very well when there is a high correlation between features and the output.

However, there are several issues that linear regression suffers from:

  • The model is prone to high bias (meaning to underfit the dataset).
  • From its name alone, it cannot handle non-linear relationships very well.  However, its possible to use mathematical functions (logs, square roots, trigonometric functions, etc.) to deal with some non-linear relationships.
  • The model doesn’t work with non-numerical data.
  • The model is also prone to the effects of outliers.  These outliers can dramatically affect the output.

Conclusion

Linear Regression allows users to model simple relationships between features and an output.  The approach is optimal when there is a high correlation between the features and the output.  However, it doesn’t work very well when representing non-linear relationships or when there are outliers.  Thus, linear regression is best used when someone wants to predict a particular value from a given dataset.

Please note, a common misconception is that multiple linear regression returns multiple values using one or more features.  The actual method is called multivariate linear regression.  This honestly took me a while to distinguish the two methods.  I found this link to help explain the difference.

Have any questions, comments, or spotted an error?  Leave a comment down below.