In this assignment we will introduce Radford Neal's software for
Bayesian Learning in neural networks by some hands-on experiments.
This software package is fairly extensive, and explaining the full
functionality of all the programs is not possible in short time;
instead we will start from common script and try to understand what is
going on, and see what happens when various interesting options are
altered. The documentation is most easily read with a browser, try
"netscape file:/usr/share/fbm/doc/index.html &" to get started.

--

Part I:

The basic script which we will use to train the network contains the
following 8 lines of code:

--

net-spec log 2 8 1 / - x0.2:0.5:1 0.1:0.5 - x0.05:0.5 x0.2:0.5:1 1
model-spec log real 0.05:0.5
data-spec log 2 1 / trains1 . tests1 . / +@x@ +@x@ / +@x@
net-gen log fix 0.5
mc-spec log repeat 10 sample-noise heatbath hybrid 100:10 0.4
net-mc log 1
mc-spec log sample-sigmas heatbath hybrid 1000:10 0.4
net-mc log 100

--

You may wish to copy these lines to a file for easy reference (you may
need to make this script file executable using "chmod 755 file-name").

The script trains a network with 8 hidden units using the data in the
file "trains1" to train and the data in the file "tests1" for testing.
To get more information on what the individual commands do, refer to the
documentation of these.

Execute the script. The last command is what actually does the work,
and it may take about 30 seconds to execute. You can monitor its
progress by using "log-last log". You can also view the network
parameters at any time using "net-display log". (Note that you can
kill the net-mc command and resume it later, simply by reissuing the
command.)

Our primary concern is to figure out whether the Markov Chain has
sampled adequately from the posterior. To do this, we use the net-plt
program to display various quantities from the log file:

The first thing to check is that the rejection rate hasn't been too
high. To plot the rejection rate, do "net-plt t r log | pt".

We can look at the error on the training set using the command
"net-plt t b log | pt -lny".

We can also plot the value of individual network weights. For example,
the weight from the first input to the first hidden unit may be
plotted using "net-plt t w1@1 log | pt". You may be very surprised how
much the value of the weight varies during the run! This however is
not a bad sign; it simply indicates that sampling procedure is able to
move around and explore weight space (although there is no guarantee
that we are not trapped in a low-dimensional manifold).

Perhaps more useful, we may also plot the values of the
hyper-parameters.  Since the hyper-parameters govern the behaviour of
many weights they tend to move less rapidly than the weights, and it
is sometimes valuable to view these parameters in order to figure out
whether we have sampled for long enough. The ARD-parameters are
plotted using "net-plt t h1@ log | pt -lny".

To make predictions for the test cases we use the program called
net-pred. To use 10 networks from iteration 21 and forwards to make
predictions based on the mean of the predictive distribution, use
"net-pred ran log 21:+10". To view the function that the network
implements use "net-pred rbin log 21:+10 > res" to dump some numbers to
a file. Then inside matlab, load the numbers using "load -ascii res".
Look at the network function using "surf(reshape(res(:,3),65,65))", or
the squared residuals using "surf(reshape(res(:,4),65,65))".

Using the test-data to asses the adequacy of the samples obtained so
far is naturally not an option for modelling real data, but it may
help to give some insight into what it going on during sampling.
Explain what may cause the answers obtained from the following pairs
of commands:

net-pred ran log 21:+25       and       net-pred ran log 21:+50

net-pred ran log 21:60        and       net-pred ran log 61:100

net-pred ran log 21:100+10    and       net-pred ran log 91:100+10

Have we reached equilibrium? Have we sampled sufficiently from the
equilibrium distribution? Are consequetive samples independent? How
many samples do we need for predictions? Some of these questions are
difficult!

You may also plot the test-error for each of the networks in the log
file using "net-plt t B log | pt -lny". How does the performance of
the individual nets compare to the performance of the ensemble?

Try other sizes of networks, eg with 4 and 2 and perhaps 16 hidden
units. At what size do we start seeing the effect of under-fitting? At
what size do we see over-fitting?

--

Part II:

Make some experiments with the 10-dimensional dataset, using train50,
train100, train200 and test10000. You will need to modify the script
in a few places to accomodate the larger input dimension. Compare the
performace of net-mc to one of the contestants from the previous days
exercises. Check the ARD parameters (see above). Is ARD able to
recover the "active" inputs?  Does the use of ARD facilitate the
interpretation of what the model is doing? Is net-mc with ARD a
black-box method?