A Neural Network in 11 lines of Python

from:http://iamtrask.github.io/2015/07/12/basic-python-network/

Summary: I learn best with toy code that I can play with. This tutorial teaches backpropagation via a very simple toy example, a short python implementation.

Edit: Some folks have asked about a followup article, and I’m planning to write one. I’ll tweet it out when it’s complete at @iamtrask. Feel free to follow if you’d be interested in reading it and thanks for all the feedback!

Just Give Me The Code:

01.X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
02.y = np.array([[0,1,1,0]]).T
03.syn0 = 2*np.random.random((3,4)) - 1
04.syn1 = 2*np.random.random((4,1)) - 1
05.for j in xrange(60000):
06.l1 = 1/(1+np.exp(-(np.dot(X,syn0))))
07.l2 = 1/(1+np.exp(-(np.dot(l1,syn1))))
08.l2_delta = (y - l2)*(l2*(1-l2))
09.l1_delta = l2_delta.dot(syn1.T) * (l1 * (1-l1))
10.syn1 += l1.T.dot(l2_delta)
11.syn0 += X.T.dot(l1_delta)

However, this is a bit terse…. let’s break it apart into a few simple parts.



Part 1: A Tiny Toy Network

A neural network trained with backpropagation is attempting to use input to predict output.

Inputs Output
0 0 1 0
1 1 1 1
1 0 1 1
0 1 1 0

Consider trying to predict the output column given the three input columns. We could solve this problem by simply measuring statistics between the input values and the output values. If we did so, we would see that the leftmost input column is perfectly correlated with the output. Backpropagation, in its simplest form, measures statistics like this to make a model. Let’s jump right in and use it to do this.

2 Layer Neural Network:

01.import numpy as np
02. 
03.# sigmoid function
04.def nonlin(x,deriv=False):
05.if(deriv==True):
06.return x*(1-x)
07.return 1/(1+np.exp(-x))
08. 
09.# input dataset
10.X = np.array([  [0,0,1],
11.[0,1,1],
12.[1,0,1],
13.[1,1,1] ])
14. 
15.# output dataset           
16.y = np.array([[0,0,1,1]]).T
17. 
18.# seed random numbers to make calculation
19.# deterministic (just a good practice)
20.np.random.seed(1)
21. 
22.# initialize weights randomly with mean 0
23.syn0 = 2*np.random.random((3,1)) - 1
24. 
25.for iter in xrange(10000):
26. 
27.# forward propagation
28.l0 = X
29.l1 = nonlin(np.dot(l0,syn0))
30. 
31.# how much did we miss?
32.l1_error = y - l1
33. 
34.# multiply how much we missed by the
35.# slope of the sigmoid at the values in l1
36.l1_delta = l1_error * nonlin(l1,True)
37. 
38.# update weights
39.syn0 += np.dot(l0.T,l1_delta)
40. 
41.print "Output After Training:"
42.print l1

</tr>

Variable Definition
X Input dataset matrix where each row is a training example
y Output dataset matrix where each row is a training example
l0 First Layer of the Network, specified by the input data
l1 Second Layer of the Network, otherwise known as the hidden layer
syn0 First layer of weights, Synapse 0, connecting l0 to l1.
* Elementwise multiplication, so two vectors of equal size are multiplying corresponding values 1-to-1 to generate a final vector of identical size.
Elementwise subtraction, so two vectors of equal size are subtracting corresponding values 1-to-1 to generate a final vector of identical size.
x.dot(y) If x and y are vectors, this is a dot product. If both are matrices, it’s a matrix-matrix multiplication. If only one is a matrix, then it’s vector matrix multiplication.

As you can see in the “Output After Training”, it works!!! Before I describe processes, I recommend playing around with the code to get an intuitive feel for how it works. You should be able to run it “as is” in an ipython notebook (or a script if you must, but I HIGHLY recommend the notebook). Here are some good places to look in the code:

• Compare l1 after the first iteration and after the last iteration.
• Check out the “nonlin” function. This is what gives us a probability as output.
• Check out how l1_error changes as you iterate.
• Take apart line 36. Most of the secret sauce is here.
• Check out line 39. Everything in the network prepares for this operation.

Let’s walk through the code line by line.

Recommendation: open this blog in two screens so you can see the code while you read it. That’s kinda what I did while I wrote it. 🙂

Line 01: This imports numpy, which is a linear algebra library. This is our only dependency.

Line 04: This is our “nonlinearity”. While it can be several kinds of functions, this nonlinearity maps a function called a “sigmoid”. A sigmoid function maps any value to a value between 0 and 1. We use it to convert numbers to probabilities. It also has several other desirable properties for training neural networks.

Line 05: Notice that this function can also generate the derivative of a sigmoid (when deriv=True). One of the desirable properties of a sigmoid function is that its output can be used to create its derivative. If the sigmoid’s output is a variable “out”, then the derivative is simply out * (1-out). This is very efficient.

If you’re unfamililar with derivatives, just think about it as the slope of the sigmoid function at a given point (as you can see above, different points have different slopes). For more on derivatives, check out this derivatives tutorialfrom Khan Academy.

Line 10: This initializes our input dataset as a numpy matrix. Each row is a single “training example”. Each column corresponds to one of our input nodes. Thus, we have 3 input nodes to the network and 4 training examples.

Line 16: This initializes our output dataset. In this case, I generated the dataset horizontally (with a single row and 4 columns) for space. “.T” is the transpose function. After the transpose, this y matrix has 4 rows with one column. Just like our input, each row is a training example, and each column (only one) is an output node. So, our network has 3 inputs and 1 output.

Line 20: It’s good practice to seed your random numbers. Your numbers will still be randomly distributed, but they’ll be randomly distributed in exactly the same way each time you train. This makes it easier to see how your changes affect the network.

Line 23: This is our weight matrix for this neural network. It’s called “syn0” to imply “synapse zero”. Since we only have 2 layers (input and output), we only need one matrix of weights to connect them. Its dimension is (3,1) because we have 3 inputs and 1 output. Another way of looking at it is that l0 is of size 3 and l1 is of size 1. Thus, we want to connect every node in l0 to every node in l1, which requires a matrix of dimensionality (3,1). 🙂

Also notice that it is initialized randomly with a mean of zero. There is quite a bit of theory that goes into weight initialization. For now, just take it as a best practice that it’s a good idea to have a mean of zero in weight initialization.

Another note is that the “neural network” is really just this matrix. We have “layers” l0 and l1 but they are transient values based on the dataset. We don’t save them. All of the learning is stored in the syn0 matrix.

Line 25: This begins our actual network training code. This for loop “iterates” multiple times over the training code to optimize our network to the dataset.

Line 28: Since our first layer, l0, is simply our data. We explicitly describe it as such at this point. Remember that X contains 4 training examples (rows). We’re going to process all of them at the same time in this implementation. This is known as “full batch” training. Thus, we have 4 different l0 rows, but you can think of it as a single training example if you want. It makes no difference at this point. (We could load in 1000 or 10,000 if we wanted to without changing any of the code).

Line 29: This is our prediction step. Basically, we first let the network “try” to predict the output given the input. We will then study how it performs so that we can adjust it to do a bit better for each iteration.

This line contains 2 steps. The first matrix multiplies l0 by syn0. The second passes our output through the sigmoid function. Consider the dimensions of each:

(4 x 3) dot (3 x 1) = (4 x 1)

Matrix multiplication is ordered, such the dimensions in the middle of the equation must be the same. The final matrix generated is thus the number of rows of the first matrix and the number of columns of the second matrix.

Since we loaded in 4 training examples, we ended up with 4 guesses for the correct answer, a (4 x 1) matrix. Each output corresponds with the network’s guess for a given input. Perhaps it becomes intuitive why we could have “loaded in” an arbitrary number of training examples. The matrix multiplication would still work out. 🙂

Line 32: So, given that l1 had a “guess” for each input. We can now compare how well it did by subtracting the true answer (y) from the guess (l1). l1_error is just a vector of positive and negative numbers reflecting how much the network missed.

Line 36: Now we’re getting to the good stuff! This is the secret sauce! There’s a lot going on in this line, so let’s further break it into two parts.

First Part: The Derivative

1.nonlin(l1,True)

If l1 represents these three dots, the code above generates the slopes of the lines below. Notice that very high values such as x=2.0 (green dot) and very low values such as x=-1.0 (purple dot) have rather shallow slopes. The highest slope you can have is at x=0 (blue dot). This plays an important role. Also notice that all derivatives are between 0 and 1.

Entire Statement: The Error Weighted Derivative

1.l1_delta = l1_error * nonlin(l1,True)

There are more “mathematically precise” ways than “The Error Weighted Derivative” but I think that this captures the intuition. l1_error is a (4,1) matrix. nonlin(l1,True) returns a (4,1) matrix. What we’re doing is multiplying them“elementwise”. This returns a (4,1) matrix l1_delta with the multiplied values.

When we multiply the “slopes” by the error, we are reducing the error of high confidence predictions. Look at the sigmoid picture again! If the slope was really shallow (close to 0), then the network either had a very high value, or a very low value. This means that the network was quite confident one way or the other. However, if the network guessed something close to (x=0, y=0.5) then it isn’t very confident. We update these “wishy-washy” predictions most heavily, and we tend to leave the confident ones alone by multiplying them by a number close to 0.

Line 39: We are now ready to update our network! Let’s take a look at a single training example.In this training example, we’re all setup to update our weights. Let’s update the far left weight (9.5).

weight_update = input_value * l1_delta

For the far left weight, this would multiply 1.0 * the l1_delta. Presumably, this would increment 9.5 ever so slightly. Why only a small ammount? Well, the prediction was already very confident, and the prediction was largely correct. A small error and a small slope means a VERY small update. Consider all the weights. It would ever so slightly increase all three.

However, because we’re using a “full batch” configuration, we’re doing the above step on all four training examples. So, it looks a lot more like the image above. So, what does line 39 do? It computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line. Play around with the matrix multiplication and you’ll see it do this!

Takeaways:

So, now that we’ve looked at how the network updates, let’s look back at our training data and reflect. When both an input and a output are 1, we increase the weight between them. When an input is 1 and an output is 0, we decrease the weight between them.

Inputs Output
0 0 1 0
1 1 1 1
1 0 1 1
0 1 1 0

Thus, in our four training examples below, the weight from the first input to the output would consistently increment or remain unchanged, whereas the other two weights would find themselves both increasing and decreasing across training examples (cancelling out progress). This phenomenon is what causes our network to learn based on correlations between the input and output.



Part 2: A Slightly Harder Problem

 

Inputs Output
0 0 1 0
0 1 1 1
1 0 1 1
1 1 1 0

Consider trying to predict the output column given the two input columns. A key takeway should be that neither columns have any correlation to the output. Each column has a 50% chance of predicting a 1 and a 50% chance of predicting a 0.

So, what’s the pattern? It appears to be completely unrelated to column three, which is always 1. However, columns 1 and 2 give more clarity. If either column 1 or 2 are a 1 (but not both!) then the output is a 1. This is our pattern.

This is considered a “nonlinear” pattern because there isn’t a direct one-to-one relationship between the input and output. Instead, there is a one-to-one relationship between a combination of inputs, namely columns 1 and 2.

Believe it or not, image recognition is a similar problem. If one had 100 identically sized images of pipes and bicycles, no individual pixel position would directly correlate with the presence of a bicycle or pipe. The pixels might as well be random from a purely statistical point of view. However, certaincombinations of pixels are not random, namely the combination that forms the image of a bicycle or a person.

Our Strategy

In order to first combine pixels into something that can then have a one-to-one relationship with the output, we need to add another layer. Our first layer will combine the inputs, and our second layer will then map them to the output using the output of the first layer as input. Before we jump into an implementation though, take a look at this table.

Inputs (l0) Hidden Weights (l1) Output (l2)
0 0 1 0.1 0.2 0.5 0.2 0
0 1 1 0.2 0.6 0.7 0.1 1
1 0 1 0.3 0.2 0.3 0.9 1
1 1 1 0.2 0.1 0.3 0.8 0

If we randomly initialize our weights, we will get hidden state values for layer 1. Notice anything? The second column (second hidden node), has a slight correlation with the output already! It’s not perfect, but it’s there. Believe it or not, this is a huge part of how neural networks train. (Arguably, it’s the only way that neural networks train.) What the training below is going to do is amplify that correlation. It’s both going to update syn1 to map it to the output, and update syn0 to be better at producing it from the input!

Note: The field of adding more layers to model more combinations of relationships such as this is known as “deep learning” because of the increasingly deep layers being modeled.

3 Layer Neural Network:

01.import numpy as np
02. 
03.def nonlin(x,deriv=False):
04.if(deriv==True):
05.return x*(1-x)
06. 
07.return 1/(1+np.exp(-x))
08. 
09.X = np.array([[0,0,1],
10.[0,1,1],
11.[1,0,1],
12.[1,1,1]])
13. 
14.y = np.array([[0],
15.[1],
16.[1],
17.[0]])
18. 
19.np.random.seed(1)
20. 
21.# randomly initialize our weights with mean 0
22.syn0 = 2*np.random.random((3,4)) - 1
23.syn1 = 2*np.random.random((4,1)) - 1
24. 
25.for j in xrange(60000):
26. 
27.# Feed forward through layers 0, 1, and 2
28.l0 = X
29.l1 = nonlin(np.dot(l0,syn0))
30.l2 = nonlin(np.dot(l1,syn1))
31. 
32.# how much did we miss the target value?
33.l2_error = y - l2
34. 
35.if (j% 10000) == 0:
36.print "Error:" + str(np.mean(np.abs(l2_error)))
37. 
38.# in what direction is the target value?
39.# were we really sure? if so, don't change too much.
40.l2_delta = l2_error*nonlin(l2,deriv=True)
41. 
42.# how much did each l1 value contribute to the l2 error (according to the weights)?
43.l1_error = l2_delta.dot(syn1.T)
44. 
45.# in what direction is the target l1?
46.# were we really sure? if so, don't change too much.
47.l1_delta = l1_error * nonlin(l1,deriv=True)
48. 
49.syn1 += l1.T.dot(l2_delta)
50.syn0 += l0.T.dot(l1_delta)

Variable Definition
X Input dataset matrix where each row is a training example
y Output dataset matrix where each row is a training example
l0 First Layer of the Network, specified by the input data
l1 Second Layer of the Network, otherwise known as the hidden layer
l2 Final Layer of the Network, which is our hypothesis, and should approximate the correct answer as we train.
syn0 First layer of weights, Synapse 0, connecting l0 to l1.
syn1 Second layer of weights, Synapse 1 connecting l1 to l2.
l2_error This is the amount that the neural network “missed”.
l2_delta This is the error of the network scaled by the confidence. It’s almost identical to the error except that very confident errors are muted.
l1_error Weighting l2_delta by the weights in syn1, we can calculate the error in the middle/hidden layer.
l1_delta This is the l1 error of the network scaled by the confidence. Again, it’s almost identical to the l1_error except that confident errors are muted.

Recommendation: open this blog in two screens so you can see the code while you read it. That’s kinda what I did while I wrote it. 🙂

Everything should look very familiar! It’s really just 2 of the previous implementation stacked on top of each other. The output of the first layer (l1) is the input to the second layer. The only new thing happening here is on line 43.

Line 43: uses the “confidence weighted error” from l2 to establish an error for l1. To do this, it simply sends the error across the weights from l2 to l1. This gives what you could call a “contribution weighted error” because we learn how much each node value in l1 “contributed” to the error in l2. This step is called “backpropagating” and is the namesake of the algorithm. We then update syn0 using the same steps we did in the 2 layer implementation.



Part 3: Conclusion and Future Work

 

My Recommendation:

If you’re serious about neural networks, I have one recommendation. Try to rebuild this network from memory. I know that might sound a bit crazy, but it seriously helps. If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it’s a killer exercise. I think it’s useful even if you’re using frameworks like Torch, Caffe, or Theano. I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I’ve made in the field (and it didn’t take long).

Future Work

This toy example still needs quite a few bells and whistles to really approach the state-of-the-art architectures. Here’s a few things you can look into if you want to further improve your network. (Perhaps I will in a followup post.)

• Alpha
Bias Units
Mini-Batches
• Delta Trimming
Parameterized Layer Sizes
• Regularization
Dropout
Momentum
Batch Normalization
• GPU Compatability
• Other Awesomeness You Implement


Want to Work in Machine Learning?

One of the best things you can do to learn Machine Learning is to have a job where you’re practicing Machine Learning professionally. I’d encourage you to check out the positions at Digital Reasoning in your job hunt. If you have questions about any of the positions or about life at Digital Reasoning, feel free to send me a message on my LinkedIn. I’m happy to hear about where you want to go in life, and help you evaluate whether Digital Reasoning could be a good fit.

If none of the positions above feel like a good fit. Continue your search! Machine Learning expertise is one of the most valuable skills in the job market today, and there are many firms looking for practitioners. Perhaps some of these services below will help you in your hunt.

 

Summary: I learn best with toy code that I can play with. This tutorial teaches gradient descent via a very simple toy example, a short python implementation. 

Followup Post: I intend to write a followup post to this one adding popular features leveraged by state-of-the-art approaches (likely Dropout, DropConnect, and Momentum). I’ll tweet it out when it’s complete @iamtrask. Feel free to follow if you’d be interested in reading more and thanks for all the feedback!

Just Give Me The Code:

01.import numpy as np
02.= np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
03.= np.array([[0,1,1,0]]).T
04.alpha,hidden_dim = (0.5,4)
05.synapse_0 = 2*np.random.random((3,hidden_dim)) - 1
06.synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
07.for in xrange(60000):
08.layer_1 = 1/(1+np.exp(-(np.dot(X,synapse_0))))
09.layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))
10.layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
11.layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 *(1-layer_1))
12.synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
13.synapse_0 -= (alpha * X.T.dot(layer_1_delta))


Part 1: Optimization

In Part 1, I laid out the basis for backpropagation in a simple neural network. Backpropagation allowed us to measure how each weight in the network contributed to the overall error. This ultimately allowed us to change these weights using a different algorithm, Gradient Descent.

The takeaway here is that backpropagation doesn’t optimize! It moves the error information from the end of the network to all the weights inside the network so that a different algorithm can optimize those weights to fit our data. We actually have a plethora of different nonlinear optimization methods that we could use with backpropagation:

A Few Optimization Methods:
• Annealing
• Stochastic Gradient Descent
• AW-SGD (new!)
• Momentum (SGD)
• Nesterov Momentum (SGD)
• AdaGrad
• AdaDelta
• ADAM
• BFGS
• LBFGS 

Visualizing the Difference:
• ConvNet.js
• RobertsDionne

Many of these optimizations are good for different purposes, and in some cases several can be used together. In this tutorial, we will walk through Gradient Descent, which is arguably the simplest and most widely used neural network optimization algorithm. By learning about Gradient Descent, we will then be able to improve our toy neural network through parameterization and tuning, and ultimately make it a lot more powerful.



Part 2: Gradient Descent

Imagine that you had a red ball inside of a rounded bucket like in the picture below. Imagine further that the red ball is trying to find the bottom of the bucket. This is optimization. In our case, the ball is optimizing it’s position (from left to right) to find the lowest point in the bucket.

(pause here…. make sure you got that last sentence…. got it?)

So, to gamify this a bit. The ball has two options, left or right. It has one goal, get as low as possible. So, it needs to press the left and right buttons correctly to find the lowest spot

So, what information does the ball use to adjust its position to find the lowest point? The only information it has is the slope of the side of the bucket at its current position, pictured below with the blue line. Notice that when the slope is negative (downward from left to right), the ball should move to the right. However, when the slope is positive, the ball should move to the left. As you can see, this is more than enough information to find the bottom of the bucket in a few iterations. This is a sub-field of optimization called gradient optimization. (Gradient is just a fancy word for slope or steepness).

Oversimplified Gradient Descent:

  • Calculate slope at current position
  • If slope is negative, move right
  • If slope is positive, move left
  • (Repeat until slope == 0)

The question is, however, how much should the ball move at each time step? Look at the bucket again. The steeper the slope, the farther the ball is from the bottom. That’s helpful! Let’s improve our algorithm to leverage this new information. Also, let’s assume that the bucket is on an (x,y) plane. So, it’s location is x (along the bottom). Increasing the ball’s “x” position moves it to the right. Decreasing the ball’s “x” position moves it to the left.

Naive Gradient Descent:

  • Calculate “slope” at current “x” position
  • Change x by the negative of the slope. (x = x – slope)
  • (Repeat until slope == 0)

Make sure you can picture this process in your head before moving on. This is a considerable improvement to our algorithm. For very positive slopes, we move left by a lot. For only slightly positive slopes, we move left by only a little. As it gets closer and closer to the bottom, it takes smaller and smaller steps until the slope equals zero, at which point it stops. This stopping point is called convergence.



Part 3: Sometimes It Breaks

Gradient Descent isn’t perfect. Let’s take a look at its issues and how people get around them. This will allow us to improve our network to overcome these issues.

Problem 1: When slopes are too big

How big is too big? Remember our step size is based on the steepness of the slope. Sometimes the slope is so steep that we overshoot by a lot. Overshooting by a little is ok, but sometimes we overshoot by so much that we’re even farther away than we started! See below. 

What makes this problem so destructive is that overshooting this far means we land at an EVEN STEEPER slope in the opposite direction. This causes us to overshoot again EVEN FARTHER. This viscious cycle of overshooting leading to more overshooting is called divergence.

Solution 1: Make Slopes Smaller

Lol. This may seem too simple to be true, but it’s used in pretty much every neural network. If our gradients are too big, we make them smaller! We do this by multiplying them (all of them) by a single number between 0 and 1 (such as 0.01). This fraction is typically a single float called alpha. When we do this, we don’t overshoot and our network converges.

Improved Gradient Descent:

alpha = 0.1 (or some number between 0 and 1)

  • Calculate “slope” at current “x” position
  • x = x – (alpha*slope)
  • (Repeat until slope == 0)

Problem 2: Local Minimums

Sometimes your bucket has a funny shape, and following the slope doesn’t take you to the absolute lowest point. Consider the picture below.

This is by far the most difficult problem with gradient descent. There are a myriad of options to try to overcome this. Generally speaking, they all involve an element of random searching to try lots of different parts of the bucket. 

Solution 2: Multiple Random Starting States

There are a myriad of ways in which randomness is used to overcome getting stuck in a local minimum. It begs the question, if we have to use randomness to find the global minimum, why are we still optimizing in the first place? Why not just try randomly? The answer lies in the graph below.

Imagine that we randomly placed 100 balls on this line and started optimizing all of them. If we did so, they would all end up in only 5 positions, mapped out by the five colored balls above. The colored regions represent the domain of each local minimum. For example, if a ball randomly falls within the blue domain, it will converge to the blue minimum. This means that to search the entire space, we only have to randomly find 5 spaces! This is far better than pure random searching, which has to randomly try EVERY space (which could easily be millions of places on this black line depending on the granularity). 

In Neural Networks: One way that neural networks accomplish this is by having very large hidden layers. You see, each hidden node in a layer starts out in a different random starting state. This allows each hidden node to converge to different patterns in the network. Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network.

Sidenote 1: This is why neural networks are so powerful! They have the ability to search far more of the space than they actually compute! We can search the entire black line above with (in theory) only 5 balls and a handful of iterations. Searching that same space in a brute force fashion could easily take orders of magnitude more computation.

Sidenote 2: A close eye might ask, “Well, why would we allow a lot of nodes to converge to the same spot? That’s actually wasting computational power!” That’s an excellent point. The current state-of-the-art approaches to avoiding hidden nodes coming up with the same answer (by searching the same space) are Dropout and Drop-Connect, which I intend to cover in a later post.

Problem 3: When Slopes are Too Small

Neural networks sometimes suffer from the slopes being too small. The answer is also obvious but I wanted to mention it here to expand on its symptoms. Consider the following graph.

Our little red ball up there is just stuck! If your alpha is too small, this can happen. The ball just drops right into an instant local minimum and ignores the big picture. It doesn’t have the umph to get out of the rut.

And perhaps the more obvious symptom of deltas that are too small is that the convergence will just take a very, very long time.

Solution 3: Increase the Alpha

As you might expect, the solution to both of these symptoms is to increase the alpha. We might even multiply our deltas by a weight higher than 1. This is very rare, but it does sometimes happen.

Part 4: SGD in Neural Networks

So at this point you might be wondering, how does this relate to neural networks and backpropagation? This is the hardest part, so get ready to hold on tight and take things slow. It’s also quite important.

That big nasty curve? In a neural network, we’re trying to minimize the error with respect to the weights. So, what that curve represents is the network’s error relative to the position of a single weight. So, if we computed the network’s error for every possible value of a single weight, it would generate the curve you see above. We would then pick the value of the single weight that has the lowest error (the lowest part of the curve). I say single weight because it’s a two-dimensional plot. Thus, the x dimension is the value of the weight and the y dimension is the neural network’s error when the weight is at that position.

Stop and make sure you got that last paragraph. It’s key. 

Let’s take a look at what this process looks like in a simple 2 layer neural network.

2 Layer Neural Network:

01.import numpy as np
02. 
03.# compute sigmoid nonlinearity
04.def sigmoid(x):
05.output = 1/(1+np.exp(-x))
06.return output
07. 
08.# convert output of sigmoid function to its derivative
09.def sigmoid_output_to_derivative(output):
10.return output*(1-output)
11. 
12.# input dataset
13.= np.array([  [0,1],
14.[0,1],
15.[1,0],
16.[1,0] ])
17. 
18.# output dataset            
19.= np.array([[0,0,1,1]]).T
20. 
21.# seed random numbers to make calculation
22.# deterministic (just a good practice)
23.np.random.seed(1)
24. 
25.# initialize weights randomly with mean 0
26.synapse_0 = 2*np.random.random((2,1)) - 1
27. 
28.for iter in xrange(10000):
29. 
30.# forward propagation
31.layer_0 = X
32.layer_1 = sigmoid(np.dot(layer_0,synapse_0))
33. 
34.# how much did we miss?
35.layer_1_error = layer_1 - y
36. 
37.# multiply how much we missed by the 
38.# slope of the sigmoid at the values in l1
39.layer_1_delta = layer_1_error *sigmoid_output_to_derivative(layer_1)
40.synapse_0_derivative = np.dot(layer_0.T,layer_1_delta)
41. 
42.# update weights
43.synapse_0 -= synapse_0_derivative
44. 
45.print "Output After Training:"
46.print layer_1

So, in this case, we have a single error at the output (single value), which is computed on line 35. Since we have 2 weights, the output “error plane” is a 3 dimensional space. We can think of this as an (x,y,z) plane, where vertical is the error, and x and y are the values of our two weights in syn0.

Let’s try to plot what the error plane looks like for the network/dataset above. So, how do we compute the error for a given set of weights? Lines 31,32,and 35 show us that. If we take that logic and plot the overall error (a single scalar representing the network error over the entire dataset) for every possible set of weights (from -10 to 10 for x and y), it looks something like this.

Don’t be intimidated by this. It really is as simple as computing every possible set of weights, and the error that the network generates at each set. x is the first synapse_0 weight and y is the second synapse_0 weight. z is the overall error. As you can see, our output data is positively correlated with the first input data. Thus, the error is minimized when x (the first synapse_0 weight) is high. What about the second synapse_0 weight? How is it optimal?

How Our 2 Layer Neural Network Optimizes

So, given that lines 31,32,and 35 end up computing the error. It can be natural to see that lines 39, 40, and 43 optimize to reduce the error. This is where Gradient Descent is happening! Remember our pseudocode?

Naive Gradient Descent:

  • Lines 39 and 40: Calculate “slope” at current “x” position
  • Line 43: Change x by the negative of the slope. (x = x – slope)
  • Line 28: (Repeat until slope == 0)

It’s exactly the same thing! The only thing that has changed is that we have 2 weights that we’re optimizing instead of just 1. The logic, however, is identical.

Part 5: Improving our Neural Network

Remember that Gradient Descent had some weaknesses. Now that we have seen how our neural network leverages Gradient Descent, we can improve our network to overcome these weaknesses in the same way that we improved Gradient Descent in Part 3 (the 3 problems and solutions).

Improvement 1: Adding and Tuning the Alpha Parameter

What is Alpha? As described above, the alpha parameter reduces the size of each iteration’s update in the simplest way possible. At the very last minute, right before we update the weights, we multiply the weight update by alpha (usually between 0 and 1, thus reducing the size of the weight update). This tiny change to the code has absolutely massive impact on its ability to train. 

We’re going to jump back to our 3 layer neural network from the first post and add in an alpha parameter at the appropriate place. Then, we’re going to run a series of experiments to align all the intuition we developed around alpha with its behavior in live code.

Improved Gradient Descent:

  • Calculate “slope” at current “x” position
  • Lines 56 and 57: Change x by the negative of the slope scaled by alpha. (x = x – (alpha*slope) )
  • (Repeat until slope == 0)
01.import numpy as np
02. 
03.alphas = [0.001,0.01,0.1,1,10,100,1000]
04. 
05.# compute sigmoid nonlinearity
06.def sigmoid(x):
07.output = 1/(1+np.exp(-x))
08.return output
09. 
10.# convert output of sigmoid function to its derivative
11.def sigmoid_output_to_derivative(output):
12.return output*(1-output)
13. 
14.= np.array([[0,0,1],
15.[0,1,1],
16.[1,0,1],
17.[1,1,1]])
18. 
19.= np.array([[0],
20.[1],
21.[1],
22.[0]])
23. 
24.for alpha in alphas:
25.print "\nTraining With Alpha:" + str(alpha)
26.np.random.seed(1)
27. 
28.# randomly initialize our weights with mean 0
29.synapse_0 = 2*np.random.random((3,4)) - 1
30.synapse_1 = 2*np.random.random((4,1)) - 1
31. 
32.for in xrange(60000):
33. 
34.# Feed forward through layers 0, 1, and 2
35.layer_0 = X
36.layer_1 = sigmoid(np.dot(layer_0,synapse_0))
37.layer_2 = sigmoid(np.dot(layer_1,synapse_1))
38. 
39.# how much did we miss the target value?
40.layer_2_error = layer_2 - y
41. 
42.if (j% 10000== 0:
43.print "Error after "+str(j)+" iterations:" +str(np.mean(np.abs(layer_2_error)))
44. 
45.# in what direction is the target value?
46.# were we really sure? if so, don't change too much.
47.layer_2_delta =layer_2_error*sigmoid_output_to_derivative(layer_2)
48. 
49.# how much did each l1 value contribute to the l2 error (according to the weights)?
50.layer_1_error = layer_2_delta.dot(synapse_1.T)
51. 
52.# in what direction is the target l1?
53.# were we really sure? if so, don't change too much.
54.layer_1_delta = layer_1_error *sigmoid_output_to_derivative(layer_1)
55. 
56.synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))
57.synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta))

So, what did we observe with the different alpha sizes?

Alpha = 0.001 
The network with a crazy small alpha didn’t hardly converge! This is because we made the weight updates so small that they hardly changed anything, even after 60,000 iterations! This is textbook Problem 3:When Slopes Are Too Small.

Alpha = 0.01 
This alpha made a rather pretty convergence. It was quite smooth over the course of the 60,000 iterations but ultimately didn’t converge as far as some of the others. This still is textbook Problem 3:When Slopes Are Too Small.

Alpha = 0.1
This alpha made some of progress very quickly but then slowed down a bit. This is still Problem 3. We need to increase alpha some more. 

Alpha = 1
As a clever eye might suspect, this had the exact convergence as if we had no alpha at all! Multiplying our weight updates by 1 doesn’t change anything. 🙂

Alpha = 10
Perhaps you were surprised that an alpha that was greater than 1 achieved the best score after only 10,000 iterations! This tells us that our weight updates were being too conservative with smaller alphas. This means that in the smaller alpha parameters (less than 10), the network’s weights were generally headed in the right direction, they just needed to hurry up and get there!

Alpha = 100
Now we can see that taking steps that are too large can be very counterproductive. The network’s steps are so large that it can’t find a reasonable lowpoint in the error plane. This is textbook Problem 1. The Alpha is too big so it just jumps around on the error plane and never “settles” into a local minimum.

Alpha = 1000
And with an extremely large alpha, we see a textbook example of divergence, with the error increasing instead of decreasing… hardlining at 0.5. This is a more extreme version of Problem 3 where it overcorrectly whildly and ends up very far away from any local minimums.

Let’s Take a Closer Look

01.import numpy as np
02. 
03.alphas = [0.001,0.01,0.1,1,10,100,1000]
04. 
05.# compute sigmoid nonlinearity
06.def sigmoid(x):
07.output = 1/(1+np.exp(-x))
08.return output
09. 
10.# convert output of sigmoid function to its derivative
11.def sigmoid_output_to_derivative(output):
12.return output*(1-output)
13. 
14.= np.array([[0,0,1],
15.[0,1,1],
16.[1,0,1],
17.[1,1,1]])
18. 
19.= np.array([[0],
20.[1],
21.[1],
22.[0]])
23. 
24. 
25. 
26.for alpha in alphas:
27.print "\nTraining With Alpha:" + str(alpha)
28.np.random.seed(1)
29. 
30.# randomly initialize our weights with mean 0
31.synapse_0 = 2*np.random.random((3,4)) - 1
32.synapse_1 = 2*np.random.random((4,1)) - 1
33. 
34.prev_synapse_0_weight_update = np.zeros_like(synapse_0)
35.prev_synapse_1_weight_update = np.zeros_like(synapse_1)
36. 
37.synapse_0_direction_count = np.zeros_like(synapse_0)
38.synapse_1_direction_count = np.zeros_like(synapse_1)
39. 
40.for in xrange(60000):
41. 
42.# Feed forward through layers 0, 1, and 2
43.layer_0 = X
44.layer_1 = sigmoid(np.dot(layer_0,synapse_0))
45.layer_2 = sigmoid(np.dot(layer_1,synapse_1))
46. 
47.# how much did we miss the target value?
48.layer_2_error = - layer_2
49. 
50.if (j% 10000== 0:
51.print "Error:" + str(np.mean(np.abs(layer_2_error)))
52. 
53.# in what direction is the target value?
54.# were we really sure? if so, don't change too much.
55.layer_2_delta =layer_2_error*sigmoid_output_to_derivative(layer_2)
56. 
57.# how much did each l1 value contribute to the l2 error (according to the weights)?
58.layer_1_error = layer_2_delta.dot(synapse_1.T)
59. 
60.# in what direction is the target l1?
61.# were we really sure? if so, don't change too much.
62.layer_1_delta = layer_1_error *sigmoid_output_to_derivative(layer_1)
63. 
64.synapse_1_weight_update = (layer_1.T.dot(layer_2_delta))
65.synapse_0_weight_update = (layer_0.T.dot(layer_1_delta))
66. 
67.if(j > 0):
68.synapse_0_direction_count +=np.abs(((synapse_0_weight_update > 0)+0-((prev_synapse_0_weight_update > 0+ 0))
69.synapse_1_direction_count +=np.abs(((synapse_1_weight_update > 0)+0-((prev_synapse_1_weight_update > 0+ 0))        
70. 
71.synapse_1 += alpha * synapse_1_weight_update
72.synapse_0 += alpha * synapse_0_weight_update
73. 
74.prev_synapse_0_weight_update = synapse_0_weight_update
75.prev_synapse_1_weight_update = synapse_1_weight_update
76. 
77.print "Synapse 0"
78.print synapse_0
79. 
80.print "Synapse 0 Update Direction Changes"
81.print synapse_0_direction_count
82. 
83.print "Synapse 1"
84.print synapse_1
85. 
86.print "Synapse 1 Update Direction Changes"
87.print synapse_1_direction_count

What I did in the above code was count the number of times a derivative changed direction. That’s the “Update Direction Changes” readout at the end of training. If a slope (derivative) changes direction, it means that it passed OVER the local minimum and needs to go back. If it never changes direction, it means that it probably didn’t go far enough.

A Few Takeaways:

  • When the alpha was tiny, the derivatives almost never changed direction.
  • When the alpha was optimal, the derivative changed directions a TON.
  • When the alpha was huge, the derivative changed directions a medium amount.
  • When the alph was tiny, the weights ended up being reasonably small too
  • When the alpha was huge, the weights got huge too!

Improvement 2: Parameterizing the Size of the Hidden Layer

Being able to increase the size of the hidden layer increases the amount of search space that we converge to in each iteration. Consider the network and output

01.import numpy as np
02. 
03.alphas = [0.001,0.01,0.1,1,10,100,1000]
04.hiddenSize = 32
05. 
06.# compute sigmoid nonlinearity
07.def sigmoid(x):
08.output = 1/(1+np.exp(-x))
09.return output
10. 
11.# convert output of sigmoid function to its derivative
12.def sigmoid_output_to_derivative(output):
13.return output*(1-output)
14. 
15.= np.array([[0,0,1],
16.[0,1,1],
17.[1,0,1],
18.[1,1,1]])
19. 
20.= np.array([[0],
21.[1],
22.[1],
23.[0]])
24. 
25.for alpha in alphas:
26.print "\nTraining With Alpha:" + str(alpha)
27.np.random.seed(1)
28. 
29.# randomly initialize our weights with mean 0
30.synapse_0 = 2*np.random.random((3,hiddenSize)) - 1
31.synapse_1 = 2*np.random.random((hiddenSize,1)) - 1
32. 
33.for in xrange(60000):
34. 
35.# Feed forward through layers 0, 1, and 2
36.layer_0 = X
37.layer_1 = sigmoid(np.dot(layer_0,synapse_0))
38.layer_2 = sigmoid(np.dot(layer_1,synapse_1))
39. 
40.# how much did we miss the target value?
41.layer_2_error = layer_2 - y
42. 
43.if (j% 10000== 0:
44.print "Error after "+str(j)+" iterations:" +str(np.mean(np.abs(layer_2_error)))
45. 
46.# in what direction is the target value?
47.# were we really sure? if so, don't change too much.
48.layer_2_delta =layer_2_error*sigmoid_output_to_derivative(layer_2)
49. 
50.# how much did each l1 value contribute to the l2 error (according to the weights)?
51.layer_1_error = layer_2_delta.dot(synapse_1.T)
52. 
53.# in what direction is the target l1?
54.# were we really sure? if so, don't change too much.
55.layer_1_delta = layer_1_error *sigmoid_output_to_derivative(layer_1)
56. 
57.synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))
58.synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta))

Notice that the best error with 32 nodes is 0.0009 whereas the best error with 4 hidden nodes was only 0.0013. This might not seem like much, but it’s an important lesson. We do not need any more than 3 nodes to represent this dataset. However, because we had more nodes when we started, we searched more of the space in each iteration and ultimately converged faster. Even though this is very marginal in this toy problem, this affect plays a huge role when modeling very complex datasets.

Part 6: Conclusion and Future Work

 

My Recommendation:

If you’re serious about neural networks, I have one recommendation. Try to rebuild this network from memory. I know that might sound a bit crazy, but it seriously helps. If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it’s a killer exercise. I think it’s useful even if you’re using frameworks like TorchCaffe, or Theano. I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I’ve made in the field (and it didn’t take long). 

Future Work

This toy example still needs quite a few bells and whistles to really approach the state-of-the-art architectures. Here’s a few things you can look into if you want to further improve your network. (Perhaps I will in a followup post.)

• Bias Units
• Mini-Batches
• Delta Trimming 
• Parameterized Layer Sizes
• Regularization
• Dropout
• Momentum
• Batch Normalization 
• GPU Compatability
• Other Awesomeness You Implement


Want to Work in Machine Learning?

One of the best things you can do to learn Machine Learning is to have a job where you’re practicing Machine Learning professionally. I’d encourage you to check out the positions at Digital Reasoning in your job hunt. If you have questions about any of the positions or about life at Digital Reasoning, feel free to send me a message on my LinkedIn. I’m happy to hear about where you want to go in life, and help you evaluate whether Digital Reasoning could be a good fit.

If none of the positions above feel like a good fit. Continue your search! Machine Learning expertise is one of the most valuable skills in the job market today, and there are many firms looking for practitioners. Perhaps some of these services below will help you in your hunt.