Neural network

I've created this page to help me figure out how artificial neural networks and "deep belief networks" work - starting with a "neural network for dummies" (simple explanation), then getting a tiny bit more detailed and then finishing with some links to more advanced resources on these topics. As someone very new to neural networks (far, far from being any type of expert), you probably should take all of this information with a big grain of salt - they are just my personal notes as I try to figure it all out!

Neural network: biological versus artificial

Technically speaking there are two types of neural networks:

1. Biological Neural Network - made up of real biological neurons connected via synapses in a nervous system, and
2. Artificial Neural Networks - composed of artificial neurons or nodes (programming constructs that try to mimic the properties of biological neurons) and connected via directed graph connections.

This article focuses on artificial neural networks (ANN), which is usually just abbreviated to neural network (NN).

The Goals of Neural Network

The typical reason for building and "training" a neural networks is to help solve "artificial intelligence problems" such as image analysis, speech recognition, adaptive control and so on. An larger goal of neural network, however, is to try to simulate the way the human brain works and learn. Since no one know how the brain works, this requires an joint effort between engineering insights/guesses and scientific discovery which many hope will help us (a) gain an understanding of how our own brain works and (b) create an artificial intelligence that can come close to or even surpass our own level of intelligence.

A Simple Example of Learning: "Grass Wet"

Before trying to understand neural networks, it's best first to consider a simple example. Determining if we have "wet grass" is a commonly used example in computer science representing our desire to guess the probability of real world events by reporting observations to a machine and then asking it to predict probability based on a new set of input values. Let's pretend we want to know if a small patch of grass is wet, and we have two binary inputs: (is raining) and (sprinkler on). Our one binary output is: (grass wet). A user will input many inputs (lets say 365 hourly time points over a year) with their corresponding output and, as you'd expect, the grass will usually be wet when either the sprinkler or rain is on, and in our ideal world the grass immediately becomes dry when both our inputs are off. How then can our computer learn this?

In such a simple example (two inputs, one output) the simplest solution is to connect the two outputs to our input and give them both a weighting of 1, such that if either is on the grass is wet (see: Figure 1a).

Lets, however add another input "(umbrella up)" and pretend this umbrella usually stops the rain hitting our patch of grass, but not our ground level sprinklers. With only three weight values, you can't possibly learn/observe the relationship that the umbrella nullifying the rain, so to solve this we would need at least one extra "layer" of two nodes and with the right weights we can adjust our system developing weightings where the sprinkler cancels the rain (see Figure 1b). The question is though, how can our computer (which starts with zero knowledge of these relationships) work out a good combination of values? How can a neural network "learn"?

How Neural Networks Work

Stochastic Nodes

While a brain consists of of neurons (brain cells) and synapses (connections between cells), most neural networks contain a group of nodes which connections between them in the form of a directed graph. Each node typically has a single stochastic value - meaning it's value can sporadically set itself to on or off (or perhaps some value in between) on each run, but its likelihood of either state is biased by the weight it assigns to each of it's inputs (Figure 2a). As the system "learns" the set of weights assigned to each input will be modified based on the accuracy of the output. The formula for calculating the probability if a node is on and off can exist in many forms, but the most common form is a "falloff gradient" shown in Figure 2b where the weights get summed together and the lower this total weight, the less likely the node is to be on.

Setting it up

The setup of a neural net can vary tremendously between different types of neural networks, but in most neural networks, nodes can be split up into in three main "layers":

• Input Layer - the top layer where each node represents an input. (eg: rain, sprinkler, umbrella)
• Hidden Layer - the middle layer(s) where the real "work" is done, but we generally don't know what's going on (hence the word "hidden"). The hidden layer itself usually consists of multiple layers feeding into each other. More layer typically mean a better result (once the system is fully trained), but also tends to increase training time exponentially.
• Output Layer - the bottom layer where each node represent an output. (eg: grass_wet)

These layers are pictured below in Figure 3. The number of input nodes and output nodes is typically determined by the number of inputs we have and the number of discrete output values we need to measure, but the number of hidden layers, hidden nodes and how they are connect and interact is usually up to the computer scientists and how much processing time he has available. In the simplest implementation we may have a "feedforward artificial network" where nodes only have directed edges towards lower layers, and no edges at the same layer. At the start, all our weights can be set to the same value (say 0.5 for each connection if their destination node has two inputs) or a random value - it doesn't matter at this stage. We now have a neural network setup, but in order to make our network learn we must first feeding it "training data".

Training the Network

For a neural network to learn it must be set a (typically large) number of "training sets". Each training set typically consist of one set of input values and one set of desired output (for this input). Following our "grass wet" example, we have 365 training set where each set may look like this (inputs:(sprinker:1, rain:0), outputs:(grassWet:1)).

For our first input set and desired output, several "runs" are conducted to test how often the desired output is randomly generated. If we (by luck) generate the right output most of the time, we might move onto the next input set..... but if we generate the wrong output too often it's time to apply "back propagation". In this step our values at the bottom-most hidden layer may be randomly changed, one by one, using a derivative of the result to see if this improves the result to a satisfactory level. Changes that lead to an overall improvement will be kept. In the result is still not satisfactory, the values another level up will be changed, and this will keep repeating until we get something satisfactory.

Now that we have weightings that satisfy our first case, we add another case and there's a good chance that the output may be undesirable. What this means is that we have to readjust weights again (as before), but what makes it neural networks particularly expensive is that they also need to check any change again all previous input sets! Only if a weight change improves the overall accuracy does it get kept - if it only improves the result for the current set, but worsens the result for our other twenty input sets then it's not worth keeping!

Given enough time, we'll be likely to generate a set of weighting similar to that shown in Figure 1b, but not necessarily the same! Interestingly there may be more than one good solution: here we could swap the two middle nodes but in large examples with hundreds of hidden nodes and several hidden layers there can be huge number of permutations which represent a "good solutions". Because of it's stochastic nature, the same neural network run again on the same input data will almost certainly yield a different set of weights each time inside our hidden layer.

Testing the Network on New Data

Once the neural network is trained is trained on a fixed number of training, the real test comes by seeing how it performs on "new input data" without showing it the correct output. Measuring performance is then a pretty easy case of feeding in many new inputs, and averaging the percentage of the time it gives the correct output for those values.

In a way each new training input is a test upon the moment it's first run. If you feed the neural network one of the training set inputs a second time, it can (and should) yield a near perfect result (since it has already been adjusted for that input/output combination)... hence so the real test comes at seeing how it performs on data it has never seen before and testing how versatile it is. For our network to do "real work" we want to use it to predict the outcome for inputs we haven't already labelled with the correct solution. In our basic "wet grass" example there are only four unique input sets (22) or eight unique input set (23) if we include the umbrella - meaning we can pretty easily train it for each unique.... but if suddenly we have 40 binary input sets we have over 1 billion (240) unique input values. Let's consider some more complex examples where we might have many inputs.

A Few Realistic Examples

Real World Observation: Revisiting our Wet Grass

In our first "wet grass example" (two inputs, one output) we did away with the hidden layer completely and after enough time you'd expect the two weights (Figure 1b) to heavily biased the grass being wet if either of our inputs are on. If we then feed it two inputs and average a few runs (lets say 100), it should tell us a pretty accurate probability of wet grass based on the input states.

Notice in the real world however, outputs are rarely 100% predictable like this. In the real world we can easily think of rare events which may cause wet grass when our input are off (eg: a flood, or a dog peed on it), or dry grass despite sprinklers and/or rain (it maybe very light rain, or the same random dog is lying over our grass protection it). The real world is full of noise. More importantly: rain doesn't evaporate instantly in the real world, so you'd want many, many more inputs. Just a few examples: (sun up), (rained in last hour) and even (grass wet one minute ago) for a better result. For each extra input you usually will want extra hidden node elements and/or layers to observe these exponentially increasing extra relationships.

In the real world, "wet grass" is not as simple as simple as it first appeared! Given enough historical input (eg: monthly rainfall over the last twenty years, plus the last few days), a neural network can perhaps guess the weather (rain or shine) for any given day.... but weather is a particularly complex system and so even the best neural network and meteorologists are pretty lousy at predicting the "week's weather ahead".

Image Classification Example: "Circle or Line" Figure 4: An example of image processing to generate a label

A big task on the internet is classifying and labeling images. This can be done by humans very quickly, but there are more images on the internet than people that wish to label them.

To make this bit simpler, lets pretend we have a bunch of greyscale images which are 16x16 pixels and show pictures of either circles, lines or nothing at all. These images may be fairly noisy, but given enough examples and hidden nodes our neural network should be able to differentiate between these relatively distinct shapes.

In image processing, the input can merely be each pixel, and rather than a binary example (like our rain etc) we probably want floating point values to represent the intensity. The order in which pixels are mapped to doesn't matter - so long as it's consistent for each 16x16 images, but the NN should eventually work out how pixels are related. When a pixels is on, it's neighbors are probably more likely to be on, so it learns this way. Not matter how many dimensions our image (1D, 2D, 3D, 4D, etc), the order doesn't matter... although if we have a huge number of pixels and higher resolution than we need it can be helpful to reduce the image in size so we have fewer pixels.

Since we have two distance labels a logical choice for out output is one binary node where we might decide:

```0 = circle
1 = line
```

So long as we're consistent it should matter. If we have n distant labels we can encode these in ceil(log2n) binary output value. If, say we're looking at cell images and have 4 classifications we'd need 2 binary values where:

```0,0 -> (1) cytoplasm     (empty space in a cell)
0,1 -> (2) vesicle       (a small spherical compartment)
1,0 -> (3) microtuble    (a narrow straight tube)
1,1 -> (4) mitochondria  (a fat bendy tube with stripes)
```

Image Segmentation Example: "Segmenting Compartments in Cell Images"

Image labeling is pretty simple: given one input, we just want one answer (eg: circle or line). Neural Networks can also be used to "segment images" by breaking them up into compartments by isolating which pixels might represent the circle. The "training sets" in this case might represent one big input image, and another image of the same size where all pixels "00" represent empty space, '01" for vesicle, and so on. After training the NN, we then might be given an arbitrarily large images (way bigger than our training data sets) and asked to try and delimit our compartments of interest.

To achieve this, one method is to keep a relatively small, fixed size input (lets say 16x16x16 = 4096) and center this on each pixel in our input image and make sure we get the deride value for that pixel. Using this method, it doesn't matter how big the input images are, we can always analyze just 4096 values at a time, since more would be incredibly time consuming.

Challenges in Neural Networks

The Problem of Oversampling

In all neural networks "oversampling" can be a big problem whereby if you show too many of one type of result, your neural network will too heavily biased that answer. For instance, if we continually see images of lines and very few circles, our NN might suddenly answer "line" no matter what we show it.

The Need for Optimization

Because there can be many layers and many nodes, neural networks can be incredibly slow and it becomes very important to use optimization. One of the slowest step is "back propagation" - changing the weights of nodes from the bottom up.

Types of Neural Networks

There are many types of NN, so the list below is by no means exhaustive

Perceptron

A Perceptron is the simplest possible type of NN. A perceptrons is a "linear classifier" which has no hidden layer (eg: Fig 1a). In the case of a "single-layer perceptron" we may have just one set of inputs mapping to a single output, and each incoming connection has a weight associated.

Restricted Boltzmann Machine

A Restricted Boltzmann Machine is a another type of (stochastic) NN with one layer of visible neurons (representing both input and output) and one layer of hidden neurons. Connections between the neurons are bidirectional an symmetric, meaning information flows in both directions during the training and usage of the network.

Deep belief network

A "deep belief network" is a probabilistic model composed of multiple layers of stochastic (sporadic), latent (hidden) variables. The latent variables typically have binary values and are typically called hidden units or feature detectors.

Nodes on each layer are have directed connections going to the layer below. The states of the bottom most layer represents a "data vector". The two most significant properties of deep belief nets are:

• An efficient, layer-by-layer procedure for learning generative weights that determine how the variables in one layer depend on the variables in the layer above.
• After learning, the values of the latent variables in every layer can be inferred by a single, bottom-up pass that starts with an observed data vector in the bottom layer and uses the generative weights in the reverse direction.

What's fancy about deep belief networks is that while most "neural networks" require data and a set of labels (eg: an image and a segmented version of the image) to verify nodes by working out the probability of a label given an image. The deep belief network doesn't necessarily require labels - it's capable of working out the "probability of the image". What it tries to do is develop a set of rules for generating an output (let's say an image) from scratch - almost like us "imagining" what different types of cats might look like, and thus when we see a image of a cat we've never seen before we can decide if it fits our rules of what a cat looks like.

Deep belief networks are actually the reason I decided to write this article and I still don't quite understand how they work, but there are a couple of great resources below to get you started.