Second attempt with unsupervised pretraining

After the previous failure, I change my strategy: I try to train each layer separately, try to find the best models parameter.

Unsupervised Layers

For the first layer, I try 2304 units gRBM, 576 units gRBM, 2304 units DAE, 576 units DAE.

DAE generally perform worse than gRBM: the training stack at 6000 reconstruction error. For gRBM with 2304 units, after training 4 days, the reconstruction error drop to 686.753 and stop decrease. For gRBM with 576 hidden units the minimum reconstruction error is 1161.23.  I decide to use the gRBM with 2304 hidden units as my first pretrained  layer.

For the second layer, I use 576 hidden units gRBM. The minimum construction error is 195.316.

For the third layer, I tried 100 hidden units and 200 hidden units. Both have similar reconstruction error: 19.655 and 20.1508.

As others result did’nt indicate the reconstruction error(Until I did the  following experiments), I can’t do comparison. But judging from the results from supervised training, I think these results may be can be further optimized.

Supervised Training

I try several configurations to find the optimized models;

  1. 2304 hidden units gRBM layer followed directly by the linear regression layer
  2. 2304 hidden units gRBM layer followed by 1 sigmoid  layer with 500 hidden units then linear regression layer.
  3. 2304 hidden units gRBM layer followed by 1 sigmoid  layer with 200 hidden units then linear regression layer.
  4. 2304 hidden units gRBM layer, and another 576 hidden units gRBM stack on it, then followed by 1 sigmoid  layer with 200 hidden units then linear regression layer.
  5. 2304 hidden units gRBM layer, and another 576 hidden units gRBM stack on it, then followed the linear regression layer.
  6. 3 gRBMs stack together, each has 2304, 576, 200 hidden units. followed by the linear regression layer.
  7. 3 gRBMs stack together, each has 2304, 576, 100 hidden units. followed by the linear regression layer.

For each configuration I try two learning  0.001 and 0.005. In most of case, 0.001 is better.

After compare all the results, I found that the two layer structures have the best validation errors. And adding a sigmoid layer after pretrained layer is not a good idea. Because the goal of supervised train is to fine tune the parameters trained in unsupervised training. The parameters in the new sigmoid layer is randomly initialized and need more tuning than other layers.

Among the models above the best is model 5, which have 90.0467 validation error and 3.15001 kaggle score which is my best until now.


First Attempt with Unsupervised Pretraining

As I never have experience in unsupervised pretraining.  I decide to begin my first attempt with the tutorial script “deep_trainer” in pylearn2.

In the tutorial, it stack a gRBM, an autoencoder and a DenoisingAutoencoder as unsupervised layers and a softmax regression layer as final supervised training layer. First I replace the softmax layer with one sigmoid layer and a linear regression layer.

I tried different combination of AE, DAE and gRBM:

  • gRBM – gRBM – gRBM
  • AE –  AE – AE
  • DAE – DAE – DAE
  • gRBM- gRBM – DAE
  • DAE – DAE – gRBM

To exploit the different of the between the unsupervised model, I use the same number of hidden units in each layer: 500-300-100. Learning rate is 0.8.

But unfortunately, the results are all very bad: the validation error is over 1000 or worse.

There are two reason may cause this failure:

  1. In this models the training was layer by layer. During the training of supervised layer the parameters in previous unsupervised layers are frozen, no actual ‘fine tuning’.
  2. The training was not long enough until the reconstruction error are minimized.

Optimize MLP

After the starter experiment, next step i decided to optimize the MLP models first.

First step I try to find the best unit types and hyper-parameters,

I test three type of units: RectifiedLinear, Sigmoid, Tanh.

I tested following network structure(And a linear regression layer as output layer):

  1. 500 RectifiedLinear 300 RectifiedLinear100 RectifiedLinear
  2. 500 RectifiedLinear 300 RectifiedLinear
  3. 500 Sigmoid – 300 Sigmoid 100 Sigmoid
  4. 500 Sigmoid – 300 Sigmoid
  5. 500 Tanh- 300 Tanh100 Tanh
  6. 500 Tanh- 300 Tanh

I also tried different learning rate vary from 0.01 to 0.001

After comparing the result, I find that the sigmoid and Tanh out perform the RectifiedLinear units. Besides sigmoid has slightly better results.

The result also shows that the 3 hidden layers structure do not improve the result compare to 2 hidden layers structure,  at least not significantly.

I also tried to applied the technique of momentum, the initial momentum is set to a small value(eg. 0.0 to 0.2) then slowly increase during the training and saturate after 30 epoch at value of 0.6. This slightly improve the result.

The best result is achieve by 500 Sigmoid units layer followed by 300 sigmoid units layer structure, with learning rate at 0.005 momentum varies between 0.05 to 0.6.  I have 130.607 validation error and 3.72225 score on kaggle public test set.

Next step, I try to increase the size of network. I try 2304, 5000, 6000 sigmoid units at first hidden layer, using 500 sigmoid units at second hidden layer. The best model is 2304 sigmoid units at first hidden layer and 500 sigmoid units at second hidden layer. Using the hyper-parameters I mention above I get95.8553 validation error and 3.33532 kaggle public test score.

Some follow up:

In nicholas leonard’s git wiki, he mention that Caglar using 6000 hidden units single hidden layer get very good score. I try to reproduce this result. But I fail to obtain this result. And I can find the correspond blog or github repo for this experiment to find out the reason I failed.

Getting Start with Keypoint data Set

Different from the previous data set.  the keypoint task is a regression instead of classification.

The raw data is 7500 images of human faces represented by 96*96 pixels, each pixel is saved by gray scale between 0 and 255. The expected  output is a 15 key points on the face. Each point is represented by x and y coordinates. Only a few images are labeled(Means have correct keypoints coordinates).

To begin with, I decide to try the basic algorithms. Thanks to the startup file provided by Vincent, it contains a simple MLP model with one hidden layer with 500 tanh units.

I get same result as others reported: around 140 Validation error and 3.97878 on kaggle public test set.

I optimize some hyper-parameters like learning rate and increase the size of hidden units. I was able to ameliorate the validation error slightly to 138-139 and 3.96335 score on kaggle public test set.

Exp4:MaxOut Networks(Jobs killed :( ) And Conclusions

In forums of kaggle, Ian mention his paper about Maxout network could be applied on the contest.

In the paper Ian propose an new activation function which return the max of unit output compute by taking the maximum across k ane feature map.

The first step, I pick some models which have relatively good result in previous experiments:

  1. Convolutional layer + MLP + softmax
  2. Two convolutional layer + MLP + softmax

For the hyper-parameters, I decide to use the example created by Ian in the scripts folder of pylearn2, I try several learning rate vary from 0.1 to 0.001.

(As I haven’t figure out how to use the GPU in LISA lab, this will be long..)


The training process is extremely long without GPU, so my experiments were killed.

The partial results shows that the  Two convolutional layer + MLP + softmax structure with dropout only applied on the MLP layer has the best validation error : 0.23.



The result shows that the deeper architecture(multiple conv layers and MLP) do help improve learning result, but has considerably long training time. After my observation, the best structure for the this task is one or two convolutional layer with one MLP then softmax.

In this experiments, I tried the momentum adjuster used in Ian’s maxout paper, using the 0.6 as final momentum instead of 0.99, it does help slightly faster convergence in number of epochs.

Exp3: Applying the transformation

In the post of Xavier Bouthillier, he offer a python module populate the training set by applying the transformation.

Using this module, I retry the different configuration during the previous experiments.


The best validation error is drop slightly at 14%(But the test error in Kaggle is drop by 9%). The best model is similar to the previous one: Convolutional layer + MLP + softmax. The hyper-parameters are:

layers: [ !obj:pylearn2.models.mlp.ConvRectifiedLinear {
layer_name: ‘h2’,
output_channels: 64,
irange: .05,
kernel_shape: [8, 8],
pool_shape: [4, 4],
pool_stride: [2, 2],
max_kernel_norm: 1.9365
}, !obj:pylearn2.models.mlp.RectifiedLinear {
layer_name: ‘h0’,
dim: 800,
sparse_init: 15,
}, !obj:pylearn2.models.mlp.Softmax {
max_col_norm: 1.9365,
layer_name: ‘y’,
n_classes: 7,
istdev: .05

Some Observation:

  1. The training error drop extremely rapidly: usually below 0.01 at first epoch, why ?
  2. Did anyone successfully using the cluster for computing, My job has been  waiting in queue for 3 days.
  3. Why the 2 Convolutional layer models didn’t work as well as the  one Convolutional layer models (I expected It will works better.)


I added a new models applied dropout technique to see if it will make any improvements. Still running.

Exp2 Convolutional Neural Network


Convolutional neural network has been successfully applied in pattern recognition such as MNIST(For example Yann LeCun’s paper ).   Pylearn2 tutorial about CNN is a perfect starting point.


  1. First step in my experience, I try some different learning rate vary from 0.1 to 0.0005, among them , 0.1 and 0.0005 were abandoned because 0.1 is generally too large to have a converge result. and 0.0005 is too low efficient, the interval between 0.01 and 0.001 is acceptable depend on the structure and other hyper-parameters(have the validation error below 25%).
  2. Momentum, I compared the result of same model trained with and without momentum, it proves the momentum helps improve the training result. But the current Momentum Adjustor in pylearn2 keep increasing the momentum during the training. I think it might be helpful if we try Momentum Adjustor increase the momentum at beginning of training, but decrease the momentum at the end of training(After certain epoch or the training error or validation error below a threshold ).
  3. Size of filter, Initially I try the filter size [10, 10] for first convolutional layer and [5, 5] for second convolutional layer(if exist).  Latter I also try the filter size[8, 8] which seems have better validation error rate.
  4. Pool shape and  pool stride:  Pool shape I use  [4, 4], I also tried [2,2] which didn’t give better result. For the pool stride, I compare the [1, 1] and [2, 2], it seems have better result.
  5. Max kernel normal: Initially I use the default value 1.9, then I read the implementation of Pierre-Luc, I also try the 0.9 which improve slightly the result(by 1%~5% on validation set depends on the models)
  6. OutPut Channels: I try 32, 64, generally  32 and 64 have similar result.


I also try some different combinations of layer to find the best result

  1. Convolutional layer + softmax
  2. Two convolutional layer + softmax
  3. Convolutional layer + MLP + softmax
  4. Two convolutional layer + MLP + softmax

For most of models I’ve try different combinations of hyper-parameters,  but due to time limit I didn’t explore all the possible configurations.


The best result have 0.0% training error and  14.5% validation error. The best result is achieved by the model of Convolutional layer + MLP + softmax. the hyper-parameters are:

layers: [ !obj:pylearn2.models.mlp.ConvRectifiedLinear {
layer_name: ‘h2’,
output_channels: 64,
irange: .05,
kernel_shape: [8, 8],
pool_shape: [4, 4],
pool_stride: [2, 2],
max_kernel_norm: 0.9
}, !obj:pylearn2.models.mlp.RectifiedLinear {
layer_name: ‘h1’,
dim: 1000,
sparse_init: 15,
}, !obj:pylearn2.models.mlp.Softmax {
max_col_norm: 1.9365,
layer_name: ‘y’,
n_classes: 7,
istdev: .05