Thursday, December 6, 2018

One Page Review of Recent Improvement of Deep Neural Networks in Optical Object Recognition and Detection.



Background :


In spite of tremendous success of neural networks in virtually all fronts of machine learning battle ground since the early 2000s, let us not forget it was not so long ago a 20 years long winter of the field since it being sentenced dead by Marvin Minsky, after his decisive book Perceptrons got published in 1969. In this monumental book, Minsky mathematically showed that the idea of perceptrons, proposed by Frank Rosenblat in 1957, although promising at some domains, cannot to perform XOR operation. Despite the claim is only true on linear single layer perceptrons, Minsky’s preeminent status in AI at the time led to significant decline in interests and fundings in research of neural networks for the last 20 years until around 2000.  

Architectures :


LeNet

Second wave of popularity of neural networks can be dates back to the success of LeNet-5 in solving MNIST handwritten digit recognition. Yann LeCun in his review paper in 1998 claimed that gradient based learning algorithm coupled with appropriated architecture can deal with image recognition with efficiency and accuracy. Convolutional Neural networks, which in essence utilizes strong local correlation in 2D space,  as well as weights sharing and sub-sampling to deal shift-invariance as a feature extracting scheme for images, and classify the feature vectors with multi-layer perceptrons lays the foundation of renaissance of neural networks.

AlexNet

In 2012, Alex Krizhevsky’s AlexNet set new record for ILSVRC classification challenge with top-5 error of only 15.3%. The second best model’s top-5 error is 26.2%. It consists five convolutional layers alternate with max-pooling layers, joined with 2 fully connected layers. AlexNet proved wider networks with sufficient training data can significantly outperform non-gradient based classifiers on image classification. Several techniques are introduced to achieve such improvement, such as using 2 GPUs, data augmentation, relu activation, and drop-out regularization.   

VGG

Karen Simonyan and Andrew Zissermanʻs VGG net is the 2nd runner in 2014 ILSVRC. Its architecture is very similar to AlexNet and LeNet, but with only one pipeline thanks to the improvement of GPUs. The network consists of 5 feature extraction blocks. Each of the feature extraction blocks consists of multiple convolution layers, and ends with one max-pooling layer. The feature extraction module is followed by a classification modules consists of three  fully connected layers. The advantage of VGG is probably its simple-to-implement structure, and open-sourced pretrained weights, making it easy to understand and deploy for various goals. However, it suffers from massive amount of parameters adding up to 550 MB, therefore the training from scratch is slow.

GoogLeNet

Champion of 2014 ILSVRC developed by Google. As we can find VGG cannot go deeper in number of layers because the accuracy drops as it go deeper with that kind of architecture (called degradatio problem). GoogLeNet first introduce the idea of ‘inception’, by breaking apart fix sized convolutional kernel into multiple convolutional modules with various depths and kernel sizes. It helps extracts feature better and easier because hand-picked fixed sized kernel is not optimal for feature extraction, and the size is not trainable. With multiple feature extractors of different designs, we avoid this inductive bias in architecture design. Besides, inceptions effectively creates an ensemble of modules of different inductive biases and therefore combining models can find better features. As a result, it generalize better and become easier to train. It also introduce the bottle net 1x1 convolution, and use average pooling to replace FC layers to reduce parameters. To conclude, fractionalization allows architecture to go deeper. GoogLeNet comes with several upgrade versions combining tricks from other networks such as residual learning, batch normalization.

RESNET

As demonstrated by GoogLeNet, going deeper can improve accuracy if 1) right architecture, 2) sufficient gradient. Resnet allows training deeper network as large as 152 layers. The contribution of the network is they found identity mapping for standard convolutional or dense layer block is not trivial, therefore the residual connection from input to output of one standard convolutional block solve the problem. When identity mapping is optimal, model can simply set the weight of residual weight to 1, while the standard blocks learns residual correlations of data. It is easy to understand in the beginning of training, the gradient vanishes across layers. Residual connection ‘high way’ allows signals to go from the last layer to the first few layers.

DENSNET

DenseNet can be seen as a variation of resnet. It put residual connections not only between adjoint blocks, but between every other blocks. It further improves the training efficiency. Besides, the classifier uses features of multiple blocks, not just the last blocks. The increase of complexity of this kind of ‘skip connections’ of model can be seen as a way of ensemble learning.

Algorithms :


Rectified Linear Unit (ReLU) activation function is used to solve the problem of vanishing gradients resulting from the saturation of output passing through the sigmoid function. ReLU allows training deeper network( number of layers larger than 8) effectively without having to normalize input to activation function.  

Local response normalization (LRN) contrasts the local difference in the feature map, while normalize local response if they are relatively homogeneous. The effect is equivalent to contrasts high frequency difference, while damps low frequency difference.  

Overlapping Max Pooling is a sub-sampling techniques that replaces average pooling of LeNet-5, and also outdo non-overlapping max pooling. The author claims overlapping helps prevent loss of information and therefore reduce error.

Drop-Out is a technique to prevent overfitting by arbitrarily deactivate neurons with a fixed probability to prevent networks overly rely on a subset of active neurons. It forces all subsets of neurons to have equal chance to learn in a stochastic fashion. The benefit is threefold. It prevents some neurons from dying, so the learning is more distributed and less likely to fall into local minimum. Second,it effectively reduce biases the same fashion as ensemble learning. Finally, it mitigates bias of initial weights by ‘shuffling’ the model ensembles to train for each training iteration.     

Xavier Initialization is a technique to prevent the bad initialization of weights. Bad initialization could lead to dead neurons and weak gradients when the biases and weights saturate activation functions( either the ends of sigmoid, or the negative side of ReLu). It is important in training very deep networks because gradients and activation tends to attenuates over layers. It also demonstrates good initialization does matter.

Batch Normalization. The author argues standard optimization suffers from internal covariate shift. Normalize activations, to remove mean and unitize variance, prevents activation from saturations thanks to the non-linear activation functions, which leads to insensitive of neurons and vanishing gradients. The technique is more distinct especially when the network is deep.

ADAM optimizer is the standard optimizer whatever model you are training. It includes all good properties that has been demonstrated powerful in previous optimizers, such as  adaptive learning rate, exponential annealing, rescale the gradient so that the gradient stays invariant to re-scaling of objective function. Besides, it has a good property to prefer flat minimum basin.














No comments:

Post a Comment