For anyone who starts out with machine learning, one of the first algorithms that one would learn is linear regression.
So what is linear regression?
Linear regression is a simple, statistical model that allows users to find the value of a dependent variable, y, using an independent variable, x. When you took pre-algebra, one of the concepts you learned were lines, especially graphing them. If you need a refresher, here’s the formula:
Where m is the slope, b is the y-intercept, x is the input, and y is the value we’re trying to calculate. If you took statistics, it’s possible to calculate m and b by using least-squares regression. Here are the formulas to calculate m and b:
Where is the correlation between x and y,
is the variance of x,
is the variance of y,
is the mean of x, and
is the mean of y. It is also possible to extend the equation to support multiple independent variables:
In this case, the is a parameter associated with
.
Relating this to Machine Learning
So, you mention linear regression, but how does this relate to machine learning? Linear regression, unsurprisingly, is a regression-based algorithm. A regression-based algorithm is where, given some feature(s), it attempts to predict a value.
A few examples that uses linear regression include:
- Predicting the price of a house given features such as size, floors, rooms, and area.
- A person using Capital Asset Pricing Model (CAPM) to evaluate the risk of an asset.
An Example
Before the algorithm can predict values, you must first train your algorithm by providing data. There are several ways to train your algorithm, each with benefits and drawbacks. For this example, we’ll use least-squares regression.
For example, suppose we had the following dataset for a group of students:
gender | ageYear | ageMonth | HeightIn | WeightLB |
---|---|---|---|---|
f | 11.91667 | 143 | 56.3 | 85 |
f | 12.91667 | 155 | 62.3 | 105 |
f | 12.75 | 153 | 63.3 | 108 |
f | 13.41667 | 161 | 59 | 92 |
f | 15.91667 | 191 | 62.5 | 112.5 |
f | 14.25 | 171 | 62.5 | 112 |
f | 15.41667 | 185 | 59 | 104 |
f | 11.83333 | 142 | 56.5 | 69 |
f | 13.33333 | 160 | 62 | 94.5 |
f | 11.66667 | 140 | 53.8 | 68.5 |
f | 11.58333 | 139 | 61.5 | 104 |
f | 14.83333 | 178 | 61.5 | 103.5 |
f | 13.08333 | 157 | 64.5 | 123.5 |
f | 12.41667 | 149 | 58.3 | 93 |
f | 11.91667 | 143 | 51.3 | 50.5 |
f | 12.08333 | 145 | 58.8 | 89 |
f | 15.91667 | 191 | 65.3 | 107 |
f | 12.5 | 150 | 59.5 | 78.5 |
f | 12.25 | 147 | 61.3 | 115 |
f | 15 | 180 | 63.3 | 114 |
f | 11.75 | 141 | 61.8 | 85 |
f | 11.66667 | 140 | 53.5 | 81 |
f | 13.66667 | 164 | 58 | 83.5 |
f | 14.66667 | 176 | 61.3 | 112 |
f | 15.41667 | 185 | 63.3 | 101 |
f | 13.83333 | 166 | 61.5 | 103.5 |
f | 14.58333 | 175 | 60.8 | 93.5 |
f | 15 | 180 | 59 | 112 |
f | 17.5 | 210 | 65.5 | 140 |
f | 12.16667 | 146 | 56.3 | 83.5 |
f | 14.16667 | 170 | 64.3 | 90 |
f | 13.5 | 162 | 58 | 84 |
f | 12.41667 | 149 | 64.3 | 110.5 |
f | 11.58333 | 139 | 57.5 | 96 |
f | 15.5 | 186 | 57.8 | 95 |
f | 16.41667 | 197 | 61.5 | 121 |
f | 14.08333 | 169 | 62.3 | 99.5 |
f | 14.75 | 177 | 61.8 | 142.5 |
f | 15.41667 | 185 | 65.3 | 118 |
f | 15.16667 | 182 | 58.3 | 104.5 |
f | 14.41667 | 173 | 62.8 | 102.5 |
f | 13.83333 | 166 | 59.3 | 89.5 |
f | 14 | 168 | 61.5 | 95 |
f | 14.08333 | 169 | 62 | 98.5 |
f | 12.5 | 150 | 61.3 | 94 |
f | 15.33333 | 184 | 62.3 | 108 |
f | 11.58333 | 139 | 52.8 | 63.5 |
f | 12.25 | 147 | 59.8 | 84.5 |
f | 12 | 144 | 59.5 | 93.5 |
f | 14.75 | 177 | 61.3 | 112 |
f | 14.83333 | 178 | 63.5 | 148.5 |
f | 16.41667 | 197 | 64.8 | 112 |
f | 12.16667 | 146 | 60 | 109 |
f | 12.08333 | 145 | 59 | 91.5 |
f | 12.25 | 147 | 55.8 | 75 |
f | 12.08333 | 145 | 57.8 | 84 |
f | 12.91667 | 155 | 61.3 | 107 |
f | 13.91667 | 167 | 62.3 | 92.5 |
f | 15.25 | 183 | 64.3 | 109.5 |
f | 11.91667 | 143 | 55.5 | 84 |
f | 15.25 | 183 | 64.5 | 102.5 |
f | 15.41667 | 185 | 60 | 106 |
f | 12.33333 | 148 | 56.3 | 77 |
f | 12.25 | 147 | 58.3 | 111.5 |
f | 12.83333 | 154 | 60 | 114 |
f | 13 | 156 | 54.5 | 75 |
f | 12 | 144 | 55.8 | 73.5 |
f | 12.83333 | 154 | 62.8 | 93.5 |
f | 12.66667 | 152 | 60.5 | 105 |
f | 15.91667 | 191 | 63.3 | 113.5 |
f | 15.83333 | 190 | 66.8 | 140 |
f | 11.66667 | 140 | 60 | 77 |
f | 12.33333 | 148 | 60.5 | 84.5 |
f | 15.75 | 189 | 64.3 | 113.5 |
f | 11.91667 | 143 | 58.3 | 77.5 |
f | 14.83333 | 178 | 66.5 | 117.5 |
f | 13.66667 | 164 | 65.3 | 98 |
f | 13.08333 | 157 | 60.5 | 112 |
f | 12.25 | 147 | 59.5 | 101 |
f | 12.33333 | 148 | 59 | 95 |
f | 14.75 | 177 | 61.3 | 81 |
f | 14.25 | 171 | 61.5 | 91 |
f | 14.33333 | 172 | 64.8 | 142 |
f | 15.83333 | 190 | 56.8 | 98.5 |
f | 15.25 | 183 | 66.5 | 112 |
f | 11.91667 | 143 | 61.5 | 116.5 |
f | 14.91667 | 179 | 63 | 98.5 |
f | 15.5 | 186 | 57 | 83.5 |
f | 15.16667 | 182 | 65.5 | 133 |
f | 15.16667 | 182 | 62 | 91.5 |
f | 11.83333 | 142 | 56 | 72.5 |
f | 13.75 | 165 | 61.3 | 106.5 |
f | 13.75 | 165 | 55.5 | 67 |
f | 12.83333 | 154 | 61 | 122.5 |
f | 12.5 | 150 | 54.5 | 74 |
f | 12.91667 | 155 | 66 | 144.5 |
f | 13.58333 | 163 | 56.5 | 84 |
f | 11.75 | 141 | 56 | 72.5 |
f | 12.25 | 147 | 51.5 | 64 |
f | 17.5 | 210 | 62 | 116 |
f | 14.25 | 171 | 63 | 84 |
f | 13.91667 | 167 | 61 | 93.5 |
f | 15.16667 | 182 | 64 | 111.5 |
f | 12 | 144 | 61 | 92 |
f | 16.08333 | 193 | 59.8 | 115 |
f | 11.75 | 141 | 61.3 | 85 |
f | 13.66667 | 164 | 63.3 | 108 |
f | 15.5 | 186 | 63.5 | 108 |
f | 14.08333 | 169 | 61.5 | 85 |
f | 14.58333 | 175 | 60.3 | 86 |
f | 15 | 180 | 61.3 | 110.5 |
m | 13.75 | 165 | 64.8 | 98 |
m | 13.08333 | 157 | 60.5 | 105 |
m | 12 | 144 | 57.3 | 76.5 |
m | 12.5 | 150 | 59.5 | 84 |
m | 12.5 | 150 | 60.8 | 128 |
m | 11.58333 | 139 | 60.5 | 87 |
m | 15.75 | 189 | 67 | 128 |
m | 15.25 | 183 | 64.8 | 111 |
m | 12.25 | 147 | 50.5 | 79 |
m | 12.16667 | 146 | 57.5 | 90 |
m | 13.33333 | 160 | 60.5 | 84 |
m | 13 | 156 | 61.8 | 112 |
m | 14.41667 | 173 | 61.3 | 93 |
m | 12.58333 | 151 | 66.3 | 117 |
m | 11.75 | 141 | 53.3 | 84 |
m | 12.5 | 150 | 59 | 99.5 |
m | 13.66667 | 164 | 57.8 | 95 |
m | 12.75 | 153 | 60 | 84 |
m | 17.16667 | 206 | 68.3 | 134 |
m | 20.83333 | 250 | 67.5 | 171.5 |
m | 14.66667 | 176 | 63.8 | 98.5 |
m | 14.66667 | 176 | 65 | 118.5 |
m | 11.66667 | 140 | 59.5 | 94.5 |
m | 15.41667 | 185 | 66 | 105 |
m | 15 | 180 | 61.8 | 104 |
m | 12.16667 | 146 | 57.3 | 83 |
m | 15.25 | 183 | 66 | 105.5 |
m | 11.66667 | 140 | 56.5 | 84 |
m | 12.58333 | 151 | 58.3 | 86 |
m | 12.58333 | 151 | 61 | 81 |
m | 12 | 144 | 62.8 | 94 |
m | 13.33333 | 160 | 59.3 | 78.5 |
m | 14.83333 | 178 | 67.3 | 119.5 |
m | 16.08333 | 193 | 66.3 | 133 |
m | 13.5 | 162 | 64.5 | 119 |
m | 13.66667 | 164 | 60.5 | 95 |
m | 15.5 | 186 | 66 | 112 |
m | 11.91667 | 143 | 57.5 | 75 |
m | 14.58333 | 175 | 64 | 92 |
m | 14.58333 | 175 | 68 | 112 |
m | 14.58333 | 175 | 63.5 | 98.5 |
m | 14.41667 | 173 | 69 | 112.5 |
m | 14.16667 | 170 | 63.8 | 112.5 |
m | 14.5 | 174 | 66 | 108 |
m | 13.66667 | 164 | 63.5 | 108 |
m | 12 | 144 | 59.5 | 88 |
m | 13 | 156 | 66.3 | 106 |
m | 12.41667 | 149 | 57 | 92 |
m | 12 | 144 | 60 | 117.5 |
m | 12.25 | 147 | 57 | 84 |
m | 15.66667 | 188 | 67.3 | 112 |
m | 14.08333 | 169 | 62 | 100 |
m | 14.33333 | 172 | 65 | 112 |
m | 12.5 | 150 | 59.5 | 84 |
m | 16.08333 | 193 | 67.8 | 127.5 |
m | 13.08333 | 157 | 58 | 80.5 |
m | 14 | 168 | 60 | 93.5 |
m | 11.66667 | 140 | 58.5 | 86.5 |
m | 13 | 156 | 58.3 | 92.5 |
m | 13 | 156 | 61.5 | 108.5 |
m | 13.16667 | 158 | 65 | 121 |
m | 15.33333 | 184 | 66.5 | 112 |
m | 13 | 156 | 68.5 | 114 |
m | 12 | 144 | 57 | 84 |
m | 14.66667 | 176 | 61.5 | 81 |
m | 14 | 168 | 66.5 | 111.5 |
m | 12.41667 | 149 | 52.5 | 81 |
m | 11.83333 | 142 | 55 | 70 |
m | 15.66667 | 188 | 71 | 140 |
m | 16.91667 | 203 | 66.5 | 117 |
m | 11.83333 | 142 | 58.8 | 84 |
m | 15.75 | 189 | 66.3 | 112 |
m | 15.66667 | 188 | 65.8 | 150.5 |
m | 16.66667 | 200 | 71 | 147 |
m | 12.66667 | 152 | 59.5 | 105 |
m | 14.5 | 174 | 69.8 | 119.5 |
m | 13.83333 | 166 | 62.5 | 84 |
m | 12.08333 | 145 | 56.5 | 91 |
m | 11.91667 | 143 | 57.5 | 101 |
m | 13.58333 | 163 | 65.3 | 117.5 |
m | 13.83333 | 166 | 67.3 | 121 |
m | 15.16667 | 182 | 67 | 133 |
m | 14.41667 | 173 | 66 | 112 |
m | 12.91667 | 155 | 61.8 | 91.5 |
m | 13.5 | 162 | 60 | 105 |
m | 14.75 | 177 | 63 | 111 |
m | 14.75 | 177 | 60.5 | 112 |
m | 14.58333 | 175 | 65.5 | 114 |
m | 13.83333 | 166 | 62 | 91 |
m | 12.5 | 150 | 59 | 98 |
m | 12.5 | 150 | 61.8 | 118 |
m | 15.66667 | 188 | 63.3 | 115.5 |
m | 13.58333 | 163 | 66 | 112 |
m | 14.25 | 171 | 61.8 | 112 |
m | 13.5 | 162 | 63 | 91 |
m | 11.75 | 141 | 57.5 | 85 |
m | 14.5 | 174 | 63 | 112 |
m | 11.83333 | 142 | 56 | 87.5 |
m | 12.33333 | 148 | 60.5 | 118 |
m | 11.66667 | 140 | 56.8 | 83.5 |
m | 13.33333 | 160 | 64 | 116 |
m | 12 | 144 | 60 | 89 |
m | 17.16667 | 206 | 69.5 | 171.5 |
m | 13.25 | 159 | 63.3 | 112 |
m | 12.41667 | 149 | 56.3 | 72 |
m | 16.08333 | 193 | 72 | 150 |
m | 16.16667 | 194 | 65.3 | 134.5 |
m | 12.66667 | 152 | 60.8 | 97 |
m | 12.16667 | 146 | 55 | 71.5 |
m | 11.58333 | 139 | 55 | 73.5 |
m | 15.5 | 186 | 66.5 | 112 |
m | 13.41667 | 161 | 56.8 | 75 |
m | 12.75 | 153 | 64.8 | 128 |
m | 16.33333 | 196 | 64.5 | 98 |
m | 13.66667 | 164 | 58 | 84 |
m | 13.25 | 159 | 62.8 | 99 |
m | 14.83333 | 178 | 63.8 | 112 |
m | 12.75 | 153 | 57.8 | 79.5 |
m | 12.91667 | 155 | 57.3 | 80.5 |
m | 14.83333 | 178 | 63.5 | 102.5 |
m | 11.83333 | 142 | 55 | 76 |
m | 13.66667 | 164 | 66.5 | 112 |
m | 15.75 | 189 | 65 | 114 |
m | 13.66667 | 164 | 61.5 | 140 |
m | 13.91667 | 167 | 62 | 107.5 |
m | 12.58333 | 151 | 59.3 | 87 |
We can train our algorithm to predict the height of a person given their weight. After training, our algorithm returned the following model for our dataset:
Now, suppose we want to know the height of a person who weight 150 pounds. Based off of the model, we would get someone who is 69 inches (or 5′ 9″).
Pros and Cons
Linear regression has the following advantages:
- Easy to understand.
- Easy to code up.
- Works very well when there is a high correlation between features and the output.
However, there are several issues that linear regression suffers from:
- The model is prone to high bias (meaning to underfit the dataset).
- From its name alone, it cannot handle non-linear relationships very well. However, its possible to use mathematical functions (logs, square roots, trigonometric functions, etc.) to deal with some non-linear relationships.
- The model doesn’t work with non-numerical data.
- The model is also prone to the effects of outliers. These outliers can dramatically affect the output.
Conclusion
Linear Regression allows users to model simple relationships between features and an output. The approach is optimal when there is a high correlation between the features and the output. However, it doesn’t work very well when representing non-linear relationships or when there are outliers. Thus, linear regression is best used when someone wants to predict a particular value from a given dataset.
Please note, a common misconception is that multiple linear regression returns multiple values using one or more features. The actual method is called multivariate linear regression. This honestly took me a while to distinguish the two methods. I found this link to help explain the difference.
Have any questions, comments, or spotted an error? Leave a comment down below.