Good Articles to learn how to implement a neural network 1


Logistic classification function

This intermezzo will cover:

If we want to do classification with neural networks we want to output a probability distribution over the classes from the output targets t

. For the classification of 2 classes t=1 or t=0

we can use the logistic function used in logistic regression . For multiclass classification there exists an extension of this logistic function called the softmax function which is used in multinomial logistic regression . The following section will explain the logistic function and how to optimize it, the next intermezzo will explain the softmax function and how to derive it.

In [1]:

Logistic function

The goal is to predict the target class t

from an input z. The probability P(t=1|z) that input z is classified as class t=1 is represented by the output y of the logistic function computed as y=σ(z). σ is the logistic function and is defined as:


This logistic function, implemented below as logistic(z) , maps the input z

to an output between 0 and 1

as is illustrated in the figure below.

We can write the probabilities that the class is t=1

or t=0 given input z



Note that input z

to the logistic function corresponds to the log odds ratio of P(t=1|z) over P(t=0|z)



This means that the logg odds ratio log(P(t=1|z)/P(t=0|z))

changes linearly with z. And if z=xw as in neural networks, this means that the logg odds ratio changes linearly with the parameters w and input samples x


In [2]:
In [3]:

Derivative of the logistic function

Since neural networks typically use gradient based opimization techniques such as gradient descent it is important to define the derivative of the output y

of the logistic function with respect to its input z. y/z

can be calculated as:


And since 1σ(z))=11/(1+ez)=ez/(1+ez)

this can be rewritten as:


This derivative is implemented as logistic_derivative(z) and is plotted below.

In [4]:
In [5]:

Cross-entropy cost function for the logistic function

The output of the model y=σ(z)

can be interpreted as a probability y that input z belongs to one class (t=1), or probability 1y that z belongs to the other class (t=0) in a two class classification problem. We note this down as: P(t=1|z)=σ(z)=y


The neural network model will be optimized by maximizing the likelihood that a given set of parameters θ

of the model can result in a prediction of the correct class of each input sample. The parameters θ transform each input sample i into an input to the logistic function zi

. The likelihood maximization can be written as:


The likelihood (θ|t,z)

can be rewritten as the joint probability of generating t and z given the parameters θ: P(t,z|θ). Since P(A,B)=P(A|B)P(B)

this can be written as:


Since we are not interested in the probability of z

we can reduce this to: (θ|t,z)=P(t|z,θ)=ni=1P(ti|zi,θ). Since ti is a Bernoulli variable , and the probability P(t|z)=y is fixed for a given θ

we can rewrite this as:


Since the logarithmic function is a monotone increasing function we can optimize the log-likelihood function argmaxθlog(θ|t,z)

. This maximum will be the same as the maximum from the regular likelihood function. The log-likelihood function can be written as:


Minimizing the negative of this function (minimizing the negative log likelihood) corresponds to maximizing the likelihood. This error function ξ(t,y)

is typically known as the cross-entropy error function (also known as log-loss):


This function looks complicated but besides the previous derivation there are a couple of intuitions why this function is used as a cost function for logistic regression. First of all it can be rewritten as:

ξ(ti,yi)={log(yi)log(1yi)if ti=1if ti=0

Which in the case of ti=1

is 0 if yi=1 (log(1)=0) and goes to infinity as yi0 (limy0log(y)=+). The reverse effect is happening if ti=0.
So what we end up with is a cost function that is 0 if the probability to predict the correct class is 1 and goes to infinity as the probability to predict the correct class goes to 0


Notice that the cost function ξ(t,y)

is equal to the negative log probability that z is classified as its correct class:


By minimizing the negative log probability, we will maximize the log probability. And since t

can only be 0 or 1, we can write ξ(t,y) as:


Which will give ξ(t,y)=ni=1[tilog(yi)+(1ti)log(1yi)]

if we sum over all n


Another reason to use the cross-entropy function is that in simple logistic regression this results in a convex cost function, of which the global minimum will be easy to find. Note that this is not necessarily the case anymore in multilayer neural networks.


Derivative of the cross-entropy cost function for the logistic function

The derivative ξ/y

of the cost function with respect to its input can be calculated as:


This derivative will give a nice formula if it is used to calculate the derivative of the cost function with respect to the inputs of the classifier ξ/z

since the derivative of the logistic function is y/z=y(1y)



This post at is generated from an IPython notebook file. Link to the full IPython notebook file


Logistic regression (classification)

This part will cover:

While the previous tutorial described a very simple one-input-one-output linear regression model, this tutorial will describe a 2-class classification neural network with two input dimensions. This model is known in statistics as the logistic regression model. This network can be represented graphically as:

Image of the logistic model

The notebook starts out with importing the libraries we need:

In [1]:

Define the class distributions

In this example the target classes t

will be generated from 2 class distributions: blue (t=1) and red (t=0). Samples from both classes are sampled from their respective distributions. These samples are plotted in the figure below. Note that X is a N×2 matrix of individual input samples xi, and that t is a corresponding N×1 vector of target values ti


In [2]:
In [3]: