After the previous failure, I change my strategy: I try to train each layer separately, try to find the best models parameter.

**Unsupervised Layers**

For the first layer, I try 2304 units gRBM, 576 units gRBM**, **2304 units DAE, 576 units DAE.

DAE generally perform worse than gRBM: the training stack at 6000 reconstruction error. For gRBM with 2304 units, after training 4 days, the reconstruction error drop to 686.753 and stop decrease. For gRBM with 576 hidden units the minimum reconstruction error is 1161.23. I decide to use the gRBM with 2304 hidden units as my first pretrained layer.

For the second layer, I use 576 hidden units gRBM. The minimum construction error is 195.316.

For the third layer, I tried 100 hidden units and 200 hidden units. Both have similar reconstruction error: 19.655 and 20.1508.

As others result did’nt indicate the reconstruction error(Until I did the following experiments), I can’t do comparison. But judging from the results from supervised training, I think these results may be can be further optimized.

**Supervised Training **

I try several configurations to find the optimized models;

- 2304 hidden units gRBM layer followed directly by the linear regression layer
- 2304 hidden units gRBM layer followed by 1 sigmoid layer with 500 hidden units then linear regression layer.
- 2304 hidden units gRBM layer followed by 1 sigmoid layer with 200 hidden units then linear regression layer.
- 2304 hidden units gRBM layer, and another 576 hidden units gRBM stack on it, then followed by 1 sigmoid layer with 200 hidden units then linear regression layer.
- 2304 hidden units gRBM layer, and another 576 hidden units gRBM stack on it, then followed the linear regression layer.
- 3 gRBMs stack together, each has 2304, 576, 200 hidden units. followed by the linear regression layer.
- 3 gRBMs stack together, each has 2304, 576, 100 hidden units. followed by the linear regression layer.

For each configuration I try two learning 0.001 and 0.005. In most of case, 0.001 is better.

After compare all the results, I found that the two layer structures have the best validation errors. And adding a sigmoid layer after pretrained layer is not a good idea. Because the goal of supervised train is to fine tune the parameters trained in unsupervised training. The parameters in the new sigmoid layer is randomly initialized and need more tuning than other layers.

Among the models above the best is model 5, which have 90.0467 validation error and 3.15001 kaggle score which is my best until now.