Thursday, December 6, 2018

Book Review : How to create a mind , Ray Kruzweil


The author Ray Kruzweil is known for his audacious prediction about near future. His prediction usually comes from his empirical understanding of how things work the way Greeks philosophers do.  He extrapolates his theories derived from 'thought experiments' to the maximum, and regress them until it fits empiricality. He has written many books that predicts future, and one of my favorite is 'singularity is near'. That book concludes that artificial reality is going to outsmart human at exponential rate. The prediction appears to be right in retrospect. Recent news from Google shows that NLP is already good enough to fool real person, and object recognition has out-perform humans in some competition. Some of those once considered the most difficult question a decade ago now looks basic to most engineers. The rate of improvement in AI indeed accelerates exponentially. This law is coined by Kruzweil as LOAR( LAW OF ACCELERATING RETURN)

He claims all information technology follows the same fashion, and Moore's Law incidentally being the most successful one. After all, our modern world is built on the computing power per unit cost, which essentially relies on miniaturization of transistors that improves at exponential rate.

The book How To Create a Mind tries to address a even more complicated question that humans have pondered since the start of history. How does mind work ? Or from engineering perspective,  how to create a mind ?

The book leads us through a series of thought experiments to deduce some basic properties of psychophysics. Some of the properties are already known by philosophers for a long time, but the author put them in light of artificial intelligence.

Chapter 1 :

Our mind that is able to retrospect the existence of itself is a unique ability of human being. Descartes's 'I think, therefore I am' can by self-evident, but it never lands on solid scientific ground. Consciousness, free will, and mind are just alternative words for soul in religious era. The essence of it is still elusive, and that's why having a theory of min is important. evidences of neuroscience are like words, but need a language -- a theory to put them into sentence. We need a unified model of our minds in order to understand how minds work.

However, is it achievable ?  The author believe it is achievable based on how scientists have successfully discovered Theory of Evolution and Theory of Relativity. They reminds us most people are limited by their ambition to see beyond their peers' believes.

Chapter 2:

The book is gigantic and worth reading times and again so here I just excerpt some quotation as footnotes of the chapter.
Q : Does the brain work the way how computers work ? 
No. We don't store every details of episodes of life in our brains. Our memories are stored in hierarchical manner. We only perceive and store those that are important to us. The selective filtering is achieved by a hierarchically organized receptive fields. Each layers reduce the dimensions (complexity) of features to push the neural networks to learn the representations is the most efficient way by attempting to minimize prediction errors. In retrospect, such hierarchical representation is effective because our world consists of objects that can be modularized into attributes of hierarchical abstractions. We recall our memories by reconstructing the input using the stored coding of the neuronal network through hierarchical layers(inverse forward method). This explains why our memories can be faulty in details while the emotional tags remain accurate.  Not only our explicit memories are stored in hierarchical patterns. The procedural routines of actions are stored in hierarchical patterns. Therefore we are able to generalize our motor controls by reusing the patterns. The same type of hierarchy also presents in our ability to recognize contexts and situations.   

 Our memories are stored in sequential patterns. We can't recall a series of number in reverse order easily. However,  they can be easily accessed in the order they are remembered. We are unable to directly reverse the sequence of a memory. This is manifested by the fact that associative learning of declarative memory (episodic and semantic) are encoded with LTP and LTD of inter-synaptic connections. The presynaptic terminal release neurotransmitters and post-synaptic dendrites have NT receptors to receive them. So the signaling pathways are directional.
We can recognize a pattern even if only part of it is perceived (seen, heard, felt), and even if it contains alterations.
Our recognition ability is apparently able to detect invariant features of a pattern -- characteristics that survive real-world variations.
Our conscious experience of our perceptions is actually changed by our interpretations. To solve a certain problem we need a certain set of policies to plan. However, the problem in real world is ambiguously defined. Therefore different interpretations occurs because each individual has varying policies and priors. We are constantly predicting the future and hypothesizing what we will experience. This expectation influences what we actually perceive.

Chapter 3 :
Neocortex is pattern recognition machine.

One Page Review of Recent Improvement of Deep Neural Networks in Optical Object Recognition and Detection.



Background :


In spite of tremendous success of neural networks in virtually all fronts of machine learning battle ground since the early 2000s, let us not forget it was not so long ago a 20 years long winter of the field since it being sentenced dead by Marvin Minsky, after his decisive book Perceptrons got published in 1969. In this monumental book, Minsky mathematically showed that the idea of perceptrons, proposed by Frank Rosenblat in 1957, although promising at some domains, cannot to perform XOR operation. Despite the claim is only true on linear single layer perceptrons, Minsky’s preeminent status in AI at the time led to significant decline in interests and fundings in research of neural networks for the last 20 years until around 2000.  

Architectures :


LeNet

Second wave of popularity of neural networks can be dates back to the success of LeNet-5 in solving MNIST handwritten digit recognition. Yann LeCun in his review paper in 1998 claimed that gradient based learning algorithm coupled with appropriated architecture can deal with image recognition with efficiency and accuracy. Convolutional Neural networks, which in essence utilizes strong local correlation in 2D space,  as well as weights sharing and sub-sampling to deal shift-invariance as a feature extracting scheme for images, and classify the feature vectors with multi-layer perceptrons lays the foundation of renaissance of neural networks.

AlexNet

In 2012, Alex Krizhevsky’s AlexNet set new record for ILSVRC classification challenge with top-5 error of only 15.3%. The second best model’s top-5 error is 26.2%. It consists five convolutional layers alternate with max-pooling layers, joined with 2 fully connected layers. AlexNet proved wider networks with sufficient training data can significantly outperform non-gradient based classifiers on image classification. Several techniques are introduced to achieve such improvement, such as using 2 GPUs, data augmentation, relu activation, and drop-out regularization.   

VGG

Karen Simonyan and Andrew Zissermanʻs VGG net is the 2nd runner in 2014 ILSVRC. Its architecture is very similar to AlexNet and LeNet, but with only one pipeline thanks to the improvement of GPUs. The network consists of 5 feature extraction blocks. Each of the feature extraction blocks consists of multiple convolution layers, and ends with one max-pooling layer. The feature extraction module is followed by a classification modules consists of three  fully connected layers. The advantage of VGG is probably its simple-to-implement structure, and open-sourced pretrained weights, making it easy to understand and deploy for various goals. However, it suffers from massive amount of parameters adding up to 550 MB, therefore the training from scratch is slow.

GoogLeNet

Champion of 2014 ILSVRC developed by Google. As we can find VGG cannot go deeper in number of layers because the accuracy drops as it go deeper with that kind of architecture (called degradatio problem). GoogLeNet first introduce the idea of ‘inception’, by breaking apart fix sized convolutional kernel into multiple convolutional modules with various depths and kernel sizes. It helps extracts feature better and easier because hand-picked fixed sized kernel is not optimal for feature extraction, and the size is not trainable. With multiple feature extractors of different designs, we avoid this inductive bias in architecture design. Besides, inceptions effectively creates an ensemble of modules of different inductive biases and therefore combining models can find better features. As a result, it generalize better and become easier to train. It also introduce the bottle net 1x1 convolution, and use average pooling to replace FC layers to reduce parameters. To conclude, fractionalization allows architecture to go deeper. GoogLeNet comes with several upgrade versions combining tricks from other networks such as residual learning, batch normalization.

RESNET

As demonstrated by GoogLeNet, going deeper can improve accuracy if 1) right architecture, 2) sufficient gradient. Resnet allows training deeper network as large as 152 layers. The contribution of the network is they found identity mapping for standard convolutional or dense layer block is not trivial, therefore the residual connection from input to output of one standard convolutional block solve the problem. When identity mapping is optimal, model can simply set the weight of residual weight to 1, while the standard blocks learns residual correlations of data. It is easy to understand in the beginning of training, the gradient vanishes across layers. Residual connection ‘high way’ allows signals to go from the last layer to the first few layers.

DENSNET

DenseNet can be seen as a variation of resnet. It put residual connections not only between adjoint blocks, but between every other blocks. It further improves the training efficiency. Besides, the classifier uses features of multiple blocks, not just the last blocks. The increase of complexity of this kind of ‘skip connections’ of model can be seen as a way of ensemble learning.

Algorithms :


Rectified Linear Unit (ReLU) activation function is used to solve the problem of vanishing gradients resulting from the saturation of output passing through the sigmoid function. ReLU allows training deeper network( number of layers larger than 8) effectively without having to normalize input to activation function.  

Local response normalization (LRN) contrasts the local difference in the feature map, while normalize local response if they are relatively homogeneous. The effect is equivalent to contrasts high frequency difference, while damps low frequency difference.  

Overlapping Max Pooling is a sub-sampling techniques that replaces average pooling of LeNet-5, and also outdo non-overlapping max pooling. The author claims overlapping helps prevent loss of information and therefore reduce error.

Drop-Out is a technique to prevent overfitting by arbitrarily deactivate neurons with a fixed probability to prevent networks overly rely on a subset of active neurons. It forces all subsets of neurons to have equal chance to learn in a stochastic fashion. The benefit is threefold. It prevents some neurons from dying, so the learning is more distributed and less likely to fall into local minimum. Second,it effectively reduce biases the same fashion as ensemble learning. Finally, it mitigates bias of initial weights by ‘shuffling’ the model ensembles to train for each training iteration.     

Xavier Initialization is a technique to prevent the bad initialization of weights. Bad initialization could lead to dead neurons and weak gradients when the biases and weights saturate activation functions( either the ends of sigmoid, or the negative side of ReLu). It is important in training very deep networks because gradients and activation tends to attenuates over layers. It also demonstrates good initialization does matter.

Batch Normalization. The author argues standard optimization suffers from internal covariate shift. Normalize activations, to remove mean and unitize variance, prevents activation from saturations thanks to the non-linear activation functions, which leads to insensitive of neurons and vanishing gradients. The technique is more distinct especially when the network is deep.

ADAM optimizer is the standard optimizer whatever model you are training. It includes all good properties that has been demonstrated powerful in previous optimizers, such as  adaptive learning rate, exponential annealing, rescale the gradient so that the gradient stays invariant to re-scaling of objective function. Besides, it has a good property to prefer flat minimum basin.