When working with machine learning algorithms, you often have to train your algorithm prior to predicting future data.  One of the most popular methods that is used for training is called gradient descent.

What is Gradient Descent?

Gradient descent is a mathematical optimization algorithm that attempts to find the minimum within a function.  I liken the operation to trying to find the lowest point in a mountainous area.

Gradient descent takes on the following form:

\theta_i = \theta_i - \alpha\frac{\partial}{\partial \theta_i} J(\theta_0,\theta_1,\ldots,\theta_n),i = 0, 1, \ldots, n

Here, \frac{\partial}{\partial \theta_i} is the partial derivative in respect to \theta_i, \alpha is the learning rate, and J(\theta_0,\theta_1,\ldots,\theta_n) is the cost function.  We define the cost function as:

J(\theta_0,\theta_1,\ldots,\theta_n) = \frac{1}{2m}\displaystyle\sum_{i=1}^{m}(h_\theta(x_i) - y_i)^2

In fact, the cost functions looks very similar to measuring the error of a least-squares regression line.  When coding up gradient descent from scratch, we often use the derivative of the cost function as follows:

\frac{\partial}{\partial \theta_i} J(\theta_0,\theta_1,\ldots,\theta_n) = \frac{1}{m}\displaystyle\sum_{i=1}^{m}(h_\theta(x_i) - y_i)x_i, x_0=1

When i = 0, we don’t include x_0 since it’s a bias.

Using Gradient Descent.

When training your model with gradient descent, you must consider a few things:

  • What’s going to be \alpha?  If your learning rate is too small, it’ll take a long time to converge.  If it’s too big, it might never converge.
  • How many iterations do you want to train?  Having more iterations means it’ll take longer to complete, but the parameters will be more accurate.
  • What kind of gradient descent are you going to use?  Yes, there are a few ways to perform gradient descent but they differ on how many test cases you want to perform per iteration.  Here are the most common:
    • “Batch” – You use all of the test cases in each iteration.  Useful for small datasets, but becomes very slow as the dataset gets larger.
    • Stochastic – You use one test case per iteration.  Before doing gradient descent, randomly arrange all of the test cases and use each one in each iteration.  While there is no guarantee that you’ll get an optimal solution, it’s useful when dealing with really big datasets.
    • Combination of the above – Separate the test cases into chunks and each chunk is used per iteration.

Pros and Cons

Unlike many training approaches, the idea of gradient descent is very easy to understand.  In addition, gradient descent scales well when there are many features.  On a theoretical level, gradient descent is well understood.  This fact is important because many of the techniques used in machine learning are proven empirically, not theoretically.  When this occurs, it becomes hard to understand why a technique works.  This page can give a better idea on the cultural difference between the two.

However, depending on the learning rate, gradient descent takes a long time to converge.  Even if it did converge, gradient descent does converge, it might not even be the optimal minimum.

For example, suppose I used gradient descent and linear regression to train with the following dataset (seen from the Linear Regression post):

genderageYearageMonthHeightInWeightLB
f11.9166714356.385
f12.9166715562.3105
f12.7515363.3108
f13.416671615992
f15.9166719162.5112.5
f14.2517162.5112
f15.4166718559104
f11.8333314256.569
f13.333331606294.5
f11.6666714053.868.5
f11.5833313961.5104
f14.8333317861.5103.5
f13.0833315764.5123.5
f12.4166714958.393
f11.9166714351.350.5
f12.0833314558.889
f15.9166719165.3107
f12.515059.578.5
f12.2514761.3115
f1518063.3114
f11.7514161.885
f11.6666714053.581
f13.666671645883.5
f14.6666717661.3112
f15.4166718563.3101
f13.8333316661.5103.5
f14.5833317560.893.5
f1518059112
f17.521065.5140
f12.1666714656.383.5
f14.1666717064.390
f13.51625884
f12.4166714964.3110.5
f11.5833313957.596
f15.518657.895
f16.4166719761.5121
f14.0833316962.399.5
f14.7517761.8142.5
f15.4166718565.3118
f15.1666718258.3104.5
f14.4166717362.8102.5
f13.8333316659.389.5
f1416861.595
f14.083331696298.5
f12.515061.394
f15.3333318462.3108
f11.5833313952.863.5
f12.2514759.884.5
f1214459.593.5
f14.7517761.3112
f14.8333317863.5148.5
f16.4166719764.8112
f12.1666714660109
f12.083331455991.5
f12.2514755.875
f12.0833314557.884
f12.9166715561.3107
f13.9166716762.392.5
f15.2518364.3109.5
f11.9166714355.584
f15.2518364.5102.5
f15.4166718560106
f12.3333314856.377
f12.2514758.3111.5
f12.8333315460114
f1315654.575
f1214455.873.5
f12.8333315462.893.5
f12.6666715260.5105
f15.9166719163.3113.5
f15.8333319066.8140
f11.666671406077
f12.3333314860.584.5
f15.7518964.3113.5
f11.9166714358.377.5
f14.8333317866.5117.5
f13.6666716465.398
f13.0833315760.5112
f12.2514759.5101
f12.333331485995
f14.7517761.381
f14.2517161.591
f14.3333317264.8142
f15.8333319056.898.5
f15.2518366.5112
f11.9166714361.5116.5
f14.916671796398.5
f15.51865783.5
f15.1666718265.5133
f15.166671826291.5
f11.833331425672.5
f13.7516561.3106.5
f13.7516555.567
f12.8333315461122.5
f12.515054.574
f12.9166715566144.5
f13.5833316356.584
f11.751415672.5
f12.2514751.564
f17.521062116
f14.251716384
f13.916671676193.5
f15.1666718264111.5
f121446192
f16.0833319359.8115
f11.7514161.385
f13.6666716463.3108
f15.518663.5108
f14.0833316961.585
f14.5833317560.386
f1518061.3110.5
m13.7516564.898
m13.0833315760.5105
m1214457.376.5
m12.515059.584
m12.515060.8128
m11.5833313960.587
m15.7518967128
m15.2518364.8111
m12.2514750.579
m12.1666714657.590
m13.3333316060.584
m1315661.8112
m14.4166717361.393
m12.5833315166.3117
m11.7514153.384
m12.51505999.5
m13.6666716457.895
m12.751536084
m17.1666720668.3134
m20.8333325067.5171.5
m14.6666717663.898.5
m14.6666717665118.5
m11.6666714059.594.5
m15.4166718566105
m1518061.8104
m12.1666714657.383
m15.2518366105.5
m11.6666714056.584
m12.5833315158.386
m12.583331516181
m1214462.894
m13.3333316059.378.5
m14.8333317867.3119.5
m16.0833319366.3133
m13.516264.5119
m13.6666716460.595
m15.518666112
m11.9166714357.575
m14.583331756492
m14.5833317568112
m14.5833317563.598.5
m14.4166717369112.5
m14.1666717063.8112.5
m14.517466108
m13.6666716463.5108
m1214459.588
m1315666.3106
m12.416671495792
m1214460117.5
m12.251475784
m15.6666718867.3112
m14.0833316962100
m14.3333317265112
m12.515059.584
m16.0833319367.8127.5
m13.083331575880.5
m141686093.5
m11.6666714058.586.5
m1315658.392.5
m1315661.5108.5
m13.1666715865121
m15.3333318466.5112
m1315668.5114
m121445784
m14.6666717661.581
m1416866.5111.5
m12.4166714952.581
m11.833331425570
m15.6666718871140
m16.9166720366.5117
m11.8333314258.884
m15.7518966.3112
m15.6666718865.8150.5
m16.6666720071147
m12.6666715259.5105
m14.517469.8119.5
m13.8333316662.584
m12.0833314556.591
m11.9166714357.5101
m13.5833316365.3117.5
m13.8333316667.3121
m15.1666718267133
m14.4166717366112
m12.9166715561.891.5
m13.516260105
m14.7517763111
m14.7517760.5112
m14.5833317565.5114
m13.833331666291
m12.51505998
m12.515061.8118
m15.6666718863.3115.5
m13.5833316366112
m14.2517161.8112
m13.51626391
m11.7514157.585
m14.517463112
m11.833331425687.5
m12.3333314860.5118
m11.6666714056.883.5
m13.3333316064116
m121446089
m17.1666720669.5171.5
m13.2515963.3112
m12.4166714956.372
m16.0833319372150
m16.1666719465.3134.5
m12.6666715260.897
m12.166671465571.5
m11.583331395573.5
m15.518666.5112
m13.4166716156.875
m12.7515364.8128
m16.3333319664.598
m13.666671645884
m13.2515962.899
m14.8333317863.8112
m12.7515357.879.5
m12.9166715557.380.5
m14.8333317863.5102.5
m11.833331425576
m13.6666716466.5112
m15.7518965114
m13.6666716461.5140
m13.9166716762107.5
m12.5833315159.387

When I ran the code, it took over 700,000 iterations before it converged.  Additionally, I had to set the learning rate to 0.00018, as anything above would cause it to diverge.

Finally, based off of my knowledge, gradient descent doesn’t work very well with multi-layered neural networks.  This is because when we attempt to use back-propagation, the gradient would get lost as we attempt to train the early layers of the network.  Thus, it would cause training to become very slow and inaccurate.

Conclusion

Gradient descent is a good training algorithm for simple models like linear regression and logistic regression.  In fact, the concept is so easy to understand that you can implement the algorithm by scratch.  Additionally, gradient descent scales well with many features.  However, once you get to more complex models, gradient descent doesn’t work very well.

Code Implementation

Here’s my implementation on gradient descent.

def gradientDesc(thetas,points,learn):
    # Procedure to calculate hypothesis
    def calcHypoth(x):
        res = thetas[0]
        for i in range(1,len(thetas)):
            res += thetas[i] * x[i-1]
        return res
    # Outer main procedure
    if len(thetas) != len(points[0]):
        print("Number of thetas and points dimension do not match!")
        return [-1] * len(thetas)
    grads = [0] * len(thetas)
    m = len(points)
    for point in points:
        hypoth = (calcHypoth(point[:-1]) - point[-1] )
        grads[0] += hypoth/m
        for i in range(1, len(grads)):
            grads[i] += (hypoth*point[i-1])/m
    # Store the new thetas
    newThetas = [0] * len(thetas)
    for i in range(len(thetas)):
        newThetas[i] = thetas[i] - (learn * grads[i])
    return newThetas

The code is geared for single and multiple linear regression.  The following parameters are passed in:

  • thetas – The initial values that we use for training.  Passed in as a list of numbers
  • points – The test cases from the dataset.  Passed in as a list of tuples in the form (x_0, x_1, \ldots,x_n, y)
  • learn – The learning rate for gradient descent

For a basic implementation on gradient descent, Matt Nedrich provides some code on gradient descent and linear regression.  I based off my implementation from his code.

Have any questions, comments, or spotted an error?  Leave a comment down below.