In my previous post, you saw the derivative of the cost function for logistic regression as:

\frac{\partial}{\partial \theta_i} J(\theta_0,\theta_1,\ldots,\theta_n) = \frac{1}{m}\displaystyle\sum_{i=1}^{m}(g(x_i) - y_i)x_i, x_0=1

I bet several of you were thinking, “How on Earth could you derive a cost function like this:

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[(y_i)log(g(x_i)) + (1 - y_i)log(1-g(x_i))]

Into a nice function like this:

\frac{\partial}{\partial \theta_i} J(\theta_0,\theta_1,\ldots,\theta_n) = \frac{1}{m}\displaystyle\sum_{i=1}^{m}(g(x_i) - y_i)x_i?”

Well, this post is going to go through the math.  Even if you already know it, it’s a good algebra and calculus problem.

Before we begin, I want to make a few notes.  First, you would normally apply multi-dimensional calculus by deriving \theta_i.  However, to make things a bit easier, I’m going to derive h_\theta(x_i) as h_\theta'(x_i).  For those wanting to use multivariate calculus, we’ll define the derivative of h_\theta(x_i) as:

\frac{\partial}{\partial \theta_i}h_\theta(x_i) = x_i

We define h_\theta(x_i) as:

h_\theta(x) = \displaystyle\sum_{i=1}^{m}(x_i\theta_i) + b

We also denote log as the natural logarithm (ln).

Finally, we define the function g(x) as follows:

g(x) = \frac{1}{1+e^{-h_\theta(x)}}

Now that we got the notes in, let’s begin:

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[(y_i)log(g(x_i)) + (1 - y_i)log(1-g(x_i))]

The first thing we want to do is expand g(x_i)

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[(y_i)log(\frac{1}{1+e^{-h_\theta(x)}}) + (1 - y_i)log(1-\frac{1}{1+e^{-h_\theta(x)}})]

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[(y_i)log(\frac{1}{1+e^{-h_\theta(x)}}) + (1 - y_i)log(\frac{e^{-h_\theta(x)}}{1+e^{-h_\theta(x)}})]

Now, using the property log(\frac{x}{y}) = log(x) - log(y), we get:

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[(y_i)(-log(1+e^{-h_\theta(x)})) + (1 - y_i)(log(e^{-h_\theta(x)}) - log(1+e^{-h_\theta(x)})]

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[-(y_i)log(1+e^{-h_\theta(x)}) + log(e^{-h_\theta(x)}) - log(1+e^{-h_\theta(x)}) - y_ilog(e^{-h_\theta(x)}) + y_ilog(1+e^{-h_\theta(x)})]

Using the property log(a^{-c}) = -clog(a), we get:

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[-log(e^{h_\theta(x)}) - log(1+e^{-h_\theta(x)}) + y_ih_\theta(x)]

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[- [log(e^{h_\theta(x)}) + log(1+e^{-h_\theta(x)})] + y_ih_\theta(x)]

Using the property log(xy) = log(x) + log(y), we get:

J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[-log(1+e^{h_\theta(x)}) + y_ih_\theta(x)]

\frac{dh_\theta(x)}{dt}J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[-\frac{h_\theta'(x)e^{h_\theta(x)}}{1+e^{h_\theta(x)}} + y_ih_\theta'(x)]

\frac{dh_\theta(x)}{dt}J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[-\frac{e^{h_\theta(x)}}{1+e^{h_\theta(x)}} + y_i]h_\theta'(x)

\frac{dh_\theta(x)}{dt}J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[-\frac{1}{1+e^{-h_\theta(x)}} + y_i]h_\theta'(x)

Since we already know that \frac{1}{1+e^{-h_\theta(x)}} can be reduced to g(x), we get:

\frac{dh_\theta(x)}{dt}J(\theta_1,\ldots,\theta_n) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}[-g(x) + y_i]h_\theta'(x)

\frac{dh_\theta(x)}{dt}J(\theta_1,\ldots,\theta_n) = \frac{1}{m}\displaystyle\sum_{i=1}^{m}[g(x) - y_i]h_\theta'(x)

If one instead used multi-dimensional calculus, with \frac{\partial}{\partial \theta_i}h_\theta(x_i) = x_i, we get:

\frac{dh_\theta(x)}{dt}J(\theta_1,\ldots,\theta_n) = \frac{1}{m}\displaystyle\sum_{i=1}^{m}[g(x) - y_i]x_i

The lesson

When I was attempting to prove the derivative of the cost function, you have to know your logarithms very well.  It allows you to simplify the equation prior to deriving.  Otherwise, if you started deriving at the wrong step, you would end up causing yourself a lot of grief down the road.  I eventually found this link to see where I was getting stuck.  From that point on, I was able to obtain the solution.

Have any questions, comments, or spotted an error on the math?  Leave a comment down below.