NYAI#5: Neural Nets (Jason Yosinski) & #ML For Production (Ken Sanford)
Posted on August 24th, 2016
08/24/2016 @Rise 43 West 23rd Street, NY, 2nd floorPreview Changes
Jason Yosinski@GeometricTechnology spoke about his work on #NeuralNets to generate pictures. He started by talking about machine learning with feedback to train a robot to move more quickly and using feedback to computer-generate pictures that are appealing to humans.
Jason next talked about AlexNet, based on work by Krizhevsky et al 2012, to classify images using a neural net with 5 convolutional layers (interleaved with max pooling and contrast layers) plus 3 fully connected layers at the end. The net with 60 million parameters was training on ImageNet which contains over 1mm images. His image classification Code is available on http://Yosinski.com.
Jason talked about how the classifier thinks about categories when it is not being trained to identify that category. For instance, the network may learn about faces even though there is no human category since it helps the system detect things such as hats (above a face) to give it context. It also identifies text to give it context on other shapes it is trying to identify.
He next talked about generating images by inputting random noise and randomly changing pixels. Some changes will cause the goal (such as a ‘lions’) to increase in confidence. Over many random moves, the goal increases in its confidence level. Jason showed many random images that elicited high levels of confidence, but the images often looked like purple-green slime. This is probably because the network, while learning, immediately discards the overall color of the image and is therefore insensitive to aberrations from normal colors. (See Erhan et al 2009)
[This also raises the question of how computer vision is different from human vision. If presented with a blue colored lion, the first reaction of a human might be to note how the color mismatches objects in the ‘lion’ category. One experiment would be to present the computer model with the picture of a blue lion and see how it is classified. Unlike computers, humans encode information beyond their list of items they have learned and this encoding includes extraneous information such as color or location. Maybe the difference is that humans incorporate a semantic layer that considers not only the category of the items, but other characteristics that define ‘lion-ness’. Color may be more central to human image processing as it has been conjectured that we have color vision so we can distinguish between ripe and rotten fruits. Our vision also taps into our expectation to see certain objects within the world and we are primed to see those objects in specific contexts, so we have contextual information beyond what is available to the computer when classifying images.]
To improve the generated pictures of ‘lions’, he next used a generator to create pictures and change them until they get a picture which has high confidence of being a ‘lion’. The generator is designed to create identifiable images. The generator can even produce pictures on objects that it has not been trained to paint. (Need to apply regularization to get better pictures for the target.)
Slides at http://s.yosinski.com/nyai.pdf
In the second talk, Ken Sanford @Ekenomics and H20.AI talked about the H2O open source project. H2O is a machine learning engine that can run in R, Python,Java, etc.
Ken emphasized how H2O (a multilayer feed forward neural network) provides a platform that uses the Java Score Code engine. This easies the transition from the model developed in training and the model used to score inputs in a production environment.
He also talked about the Deep Water project which aims to allow other open source tools, such as MXNET, Caffe, Tensorflow,… (CNN, RNN, … models) to run in the H2O environment.
Advanced #DeepLearning #NeuralNets: #TimeSeries
Posted on June 16th, 2016
06/15/2016 @Qplum, 185 Hudson Street, Jersey City, NJ, suite 1620
Sumit then broke the learning process into two steps: feature extraction and classification. Starting with raw data, the feature extractor is the deep learning model that prepares the data for the classifier which may be a simple linear model or random forest. In supervised training, errors in the prediction output by the classifier are feed back into the system using back propagation to tune the parameters of the feature extractor and the classifier.
In the remainder of the talk Sumit concentrated on how to improve the performance of the feature extractor.
In the general text classification (unlike image or speech recognition) the length of the input can be very long (and variable in length). In addition, analysis of text by general deep learning models
- does not capture order of words or predictions in time series
- can handle only small sized windows or the number of parameters explodes
- cannot capture long term dependencies
So, the feature extractor is cast as a time delay neural networks (#TDNN). In TDNN, the words are text is viewed as a string of words. Kernel matrices (usually of from 3 to 5 unit long) are defined which compute a dot products of the weights of the words in a contiguous block of text. The kernel matrix is shifted one word and the process is repeated until all words are processed. A second kernel matrix creates another set of features and so forth for a 3rd kernel, etc.
These features are then pooled using the mean or max of the features. This process is repeated to get additional features. Finally a point-wise non-linear transformation is applied to get the final set of features.
Unlike traditional neural network structures, these methods are new, so no one has done a study of what is revealed in the first layer, second layer, etc. Also theoretical work is lacking on the optimal number of layers for a text sample of a given size.
Historically, #TDNN has struggled with a series of problem including convergence issues, so recurrent neural networks (#RNN) were developed in which the encoder looks at the latest data point along with its own previous output. One example is the Elman Network, which each feature is the weighted sum of the kernel function (one encoder is used for all points on the time series) output with the previously computed feature value. Training is conducted as in a standard #NN using back propagation through time with the gradient accumulated over time before the encoder is re-parameterized, but RNN has a lot issues
1, exploding or vanishing gradients – depending on the largest eigenvalue
2. cannot capture long-term dependencies
3. training is somewhat brittle
The fix is called Long short-term memory. #LSTM, has additional memory “cells” to store short-term activations. It also has additional gates to alleviate the vanishing gradient problem.
(see Hochreiter et al . 1997). Now each encoder is made up of several parts as shown in his slides. It can also have a forget gate that turns off all the inputs and can peep back at the previous values of the memory cell. At Facebook, NLP and speech and vision recognition are all users of LSTM models
LSTM models, however still don’t have a long term memory. Sumit talked about how creating memory networks which will take a store and store the key features in a memory cell. A query runs against the memory cell and then concatenates the output vector with the text. A second query will retrieve the memory.
He also talked about using a dropout method to fight overfitting. Here, there are cells that randomly determine whether a signal is transmitted to the next layer
Autocoders can be used to pretrain the weights within the NN to avoid problems of creating solution that are only locally optimal instead of globally optimal.
[Many of these methods are similar in spirit to existing methods. For instance, kernel functions in RNN are very similar to moving average models in technical trading. The different features correspond to averages over different time periods and higher level features correspond to crossovers of the moving averages.
The dropoff method is similar to the techniques used in random forest to avoid overfitting.]
#DeepLearning and #NeuralNets
Posted on May 16th, 2016
05/16/2016 @Qplum, 185 Hudson Street, Suite 1620 Plaza 5, Jersey City, NJ
#Raghavendra Boomaraju @ Columbia described the math behind neural nets and how back propagation is used to fit models.
Observations on deep learning include:
- Universal approximation theory says you can fit any model with one hidden layer, provided the layer has a sufficient number of levels. But multiple hidden layers work better. The more layers, the fewer levels you need in each layer to fit the data.
- To optimize the weights, back-propagate the loss function. But one does not need to optimize the g() function since g()’s are designed to have a very general shape (such as the logistic)
- Traditionally, fitting has been done by changing all inputs simultaneously (deterministic) or changing one input at a time during optimization (stochastic inputs) . More recently, researchers are changing subsets of the inputs (minibatches).
- Convolution operators are used to standardize inputs by size and orientation by rotating and scaling.
- There is a way to Visualize within a neural network – see http://colah.github.io/
- The gradient method used to optimize weights needs to be tuned so it is neither too aggressive nor too slow. Adaptive learning (Adam algorithm) is used to determine the optimal step size.
- Deep learning libraries include Theano, café, Google tensor flow, torch.
- To do unsupervised deep learning – take inputs through a series of layers that at some point have fewer levels than the number of inputs. The ensuing layers expand so the number of points on the output layer matches that of the input layer. Optimize the net so that the inputs match the outputs. The layer with the smallest number of point s describes the features in the data set.
CodeDrivenNYC: #Web #Annotation, #NeuralNets #DeepLearning, #WebGL #Anatomy
Posted on December 17th, 2015
12/16/2015 @FirstMarkCap, 100 5th Ave, NY
Three speakers talked about challenging programming problems and how they solved them
Matt Brown @Genius talked about how they implemented their product which allows users to annotate text on web pages. The challenge is locating the text that was annotated on a web page and the web page may be modified after the annotation was added. In this case, the text fragment and the location of the fragment may have changed, but the annotation should still point to the same part of the text. This means that the location of the text in the dom may have changed and the fragment itself may have been modified.
To restore the annotation they use fuzzy matching in the following steps
- Identify regions that may hold the text
- Conduct a fuzzy search to find possible starting and ending points for the matching text
- Highlight the text that is the closest match from the candidates in the fuzzy search
The user highlights text in the original web page and the program stores the highlighted fragment along with text showing the context both before and after the fragment.
When the user loads the web page, the following steps are performed to locate the fragment
- Use jQuery body.text to extract all text from the web site
- Build a list of infrequently used words and locate these words in the web site text
- Reverse the order of characters in both the fragment and the text. Repeat the previous step to determine possible ending points of the fragment in the text.
- Extract candidate locations for the fragment and pick the location which has the minimum Levenshtein distance (fewest character substitutions/inserts/removals).
- Highlight the text in that location. Repeat this process for each stored fragment.
Next, Peter Brodsky @HyperScience spoke about how his company is making the training of neural nets more efficient. HyperScience trains neural nets (containing up to 6 layers) on a variety of tasks (e.g. looking for abnormal employee behavior, reassembling shredded documents, eliminating porn from web sites).
The problems they want to overcome are
- Local minimum solutions are obtained instead of a global minimum
- Expensive to train
- Poor reuse
To overcome these problems they do the following. Once the nets are trained, they examine the nets and extract subnets that have similar patterns of weights. They test whether these subnets are performing common functions by swapping subnets across neural networks. If the performance does not change then they assume that the subnets are performing a common task. Over time they create libraries of subnets.
They can then describe the internal structure of the net in terms of the functions of subnets instead of in terms of nodes. This improves their ability to understand the processing within the net.
This has several advantages.
- They can create larger and more complex networks
- They can start with a weight vector and guide the net away from local minima and toward the global minimum.
- Their networks should learn faster since the standard building blocks are already in place and do not need to be reinvented.
In the third presentation, Tarek Sherif @BioDigital talked about how BioDigital is implementing anatomical content for the web. The challenge is to create 3d, interactive pictures showing human bodies in motion or in sections, layers, etc.
The content can be static, animated or a series of animations. The challenge is to keep the size down for quick downloads, but have the user experience the beauty of the images.
Displaying anatomical content is challenging since it can be
- Deeply nested – e.g. brain inside skull
- Hierarchical – is the click on the hand or the arm?
- Scale – from cells to the whole body
User interactions can include –highlighting, dissection, isolation, transparency, annotation, rotation,…
Mobile is even more challenging
- Limited memory and GPU
- Variety of devices
- GL variable limits
- Shader precision
- Available extensions
To allow their images to be plugged into web sites, they create an API
- Create an iframe to embed into a page
- Allows basic interactions
- 3d terminology and concepts
- 3d navigation
- Anatomical concepts
- Architecture of the Human
Examples can be seen at https://developer.biodigital.com
The artists primarily use Maya and Zbrush as their creative tools.
Models can be customized for specific patients.
Investing using #DeepLearning, #MacroTrading and #Chatbots
Posted on June 2nd, 2017
Qplum, 185 Hudson Street , Jersey City, suite 1620
Mansi Singhal and Gaurav Chakravorty @Qplum gave two presentations on how Qplum uses machine learning within a systematic macro investment strategy. Mansi talked about how a macro economic world view is used to focus the ML team on target markets. She walked the audience through an economic analysis of the factors driving the U.S. residential housing market and how an understanding of the drivers (interest rates, GDP, demographics,…) and anticipation of future economic trends (e.g. higher interest rates) would lead them to focus on (or not consider) that market for further analysis by the ML group.
Gaurav (http://slides.com/gchak/deep-learning-making-trading-a-science#/) talked about how they use an AutoEncoder to better understand the factors driving a statistical arbitrage strategy. Here, instead of using a method like principal components analysis, they use a deep learning algorithm to determine the factors driving the prices of a group of stocks. The model uses a relatively shallow neural net. To understand the underlying factors, they look at which factors are the largest driver of current market moves and determine the historical time periods when this factor has been active. One distinction between their factor models and classic CAPM models is that non-linearities are introduced by the activation functions within each layer of the neural net.
Next, Aziz Lookman talked about his analysis showing that an analysis of county-by-county unemployment rates affects the default rates (and therefore the investment returns) on loans within Lending Club.
Lastly, Hardik Patel @Qplum talked about the opportunities and challenges of creating a financial chatbot. The opportunity is that the investment goals and concerns are unique for each customer, so each will have different questions and need different types of information and advice.
The wide variety of questions and answers challenges the developer so their approach has been to develop and LSTM model of the questions which will point the bot to a template that will generate the answer. Their initial input will use word vectors and bag of words methods to map questions to categories.
What #VideoGames can do for #AI
Posted on May 28th, 2017
05/25/2017 @ Galvanize, 315 Hudson Street, NY, 2nd floor
Julian Togelius @NYU spoke about the state of competitions to create controllers to play video games. Much of what he talked about is contained in his paper on The #Mario AI Championship 2009-2012
The first winner in 2009 used an A* search of the action space. The A* algorithm is a complete search of the graph of possible actions prioritizing the search based on the distance from the origin to each current node + the estimated distance from each current node to the goal.
The contest in 2010 was won by Bojarski & Congdon – #Realm using a rule based agent
The competition has expanded to include a trying to create Bayesian networks to play Mario Brothers like a human: Togelius & Yannakakis 2012. See https://pdfs.semanticscholar.org/2d0b/34e31f02455c2d370a84645b295af6d59702.pdf
Another part of the competition seeks to create programs that can play multiple games and carry their learning from one game to the next as opposed to custom programs can only play a single game
Therefore they created a general video game playing competition – games written in Video Game Description Language. (http://people.idsia.ch/~tom/publications/pyvgdl.pdf) Programs are written in Java and access a competition API.
The programs are split into two competitions
- Get the framework, but cannot train – solutions are variations on search
- Do not get the framework, but can train the network – solutions are closer to neural nets
Building #ImageClassification models that are accurate and efficient
Posted on April 28th, 2017
04/28/2017 @NYUCourantInstitute, 251 Mercer Street, NYC, room 109
Laurens van der Maaten @Facebook spoke about some of the new technologies used by Facebook to increase accuracy and lower processing needed in image identification.
He first talked about residual networks which they are developing to replace standard convolutional neural networks. Residual networks can be thought of as a series of blocks each of which is a tiny #CNN:
- 1×1 layer, like a PCA
- 3×3 convolution layer
- 1×1 layer, inverse PCA
The raw input is added to the output of this mini-network followed by a RELU transformation.
These transformations extract features while keeping information that is input into the block, so the map is changed, but does not need to be re-learned from scratch. This eliminates some problems with vanishing gradients in the back propagation as well as the unidentifiabiliy problem.
Blocks when executed in sequence gradually add features, but removing a block after training hardly degrades performance (Huang et al 2016). From this observation they concluded that the blocks were performing two functions: detect new features and pass through some of the information in the raw input. Therefore, this structure could be made more efficient if they pass through the information yet allowed each block to only extract features.
DenseNets gives each block in each layer access to all features in the layer before it. The number of feature maps increases in each layer, so there is the possibility of a combinatorial explosion of units with each layer. Fortunately, this does not happen as each layer adds 32 new modules but the computation is more efficient, so the aggregate amount of computation for a given level of accuracy decreases when using DenseNet in favor of ResNet while accuracy improves.
Next Laurens talked about making image recognition more efficient, so a larger number of images could be processed with the same level of accuracy in a shorter average time.
He started by noting that some images are easier to identify than others. So, the goal is to quickly identify the easy images and only spend further processing time on the harder, more complex images.
The key is noting that easy images are classified using only a coarse grid, but then harder images would not be classifiable. On the other hand, using a fine grid makes it harder to classify the easy image.
Laurens described a hybrid 2-d network in which there are layers analyzing the image using the coarse grid and layers analyzing the fine grid. The fine grain blocks occasionally feed into the coarse grain blocks. At each layer outputs are tested to see if the confidence level for any image exceeds a threshold. Once the threshold is exceeded, processing is stopped and the prediction is output. In this way, when the decision is easy, this conclusion is arrived at quickly. Hard images continue further down the layers and require more processing.
By estimating the percentage of the classifier exiting at each threshold, then can time the threshold levels so that more images can be processed within a given time budget
During the Q&A, Laurens said
- To avoid overfitting the model, they train the network on both the original images as well as these same images after small transformation have been done on each image.
- They are still working to expand the #DenseNet to see its upper limits on accuracy
- He is not aware of any neurophysiological structures in the human brain that correspond to the structure of blocks in #ResNet / DenseNet.
From #pixels to objects: how the #brain builds rich representation of the natural world
Posted on April 15th, 2017
04/06/2017 @RutgersUniversity, Easton Hub Auditorum, Fiber Optics Building, Busch Campus
Jack Galliant @UCBerkeley presented a survey of current research on mapping the neurophysiology of the visual system in the brain. He first talked about the overall view of visual processing since the Felleman and Van Essen article in Cerebral Cortex in 1992. Their work on macaque monkey showed that any brain area has a 50% chance of being connected to any other part of the brain. Visual processing can be split into 3 areas
1.Early visual area – 2.intermediate visual areas – 3.high level visual areas
With pooling nonlinear transformations between areas (the inspiration for the non-linear mappings in convolutional neural nets (CNN)). The visual areas were identified by retinotopic maps – about 60 areas in humans with macaques having 10 to 15 areas in the V1 area.
Another important contribution was by David J. Field who argued that the mammalian visual system can only be understood relative to the images it is exposed to. In addition, natural images have a very specific structure – 1/f noise in the power spectrum – due to the occlusion of images which can be viewed from any angle (see Olshausen & Field, American scientist, 2000)
This lead to research resolving natural images by characterizing them by the correlation of pairs of points. Beyond pairs of points that approach becomes too computational intensive. In summary, natural images are only a small part of the universe of images (most of which humans classify as white noise)
Until 2012, researchers needed to specify the characteristics to identify items in images, but LeCun, Bengio & Hinton, Nature, 2015 showed that Alexnet could resolve many images using multiple layer models, faster computation, and lots of data. These deep neural nets work well, but the reasons for their success have yet to be worked out (He estimates it will take 5 to 10 years for the math to catch up).
One interesting exercise is running a CNN and then looking for activation in a structure in the brain: mapping the convolutional layers and feature layers to the correspondence on layers in the visual cortex. This reveals that V1 has bi-or tri-phasic functions – Gabor functions in different orientations. This is highly efficient as a sparse code needs to activate as few neurons as possible.
Next they used motion-energy models to see how mammals detect motion in the brain Voxels in V1 (Shinji Nishimoto). They determined that monitoring takes 10 to 20ms using Utah arrays to monitor single neurons. They have animal watch movies and analyze the input images using combination of complex and simple cell models (use Keras) to model neurons in V1 and V2 using a 16ms time scale.
High level visual areas
Jack then talked about research identify neurons in high level visual areas that respond to specific stimuli. Starting with fMRI his groups (Huth, Nishimoto, Vu & Gallant, Neuron, 2012) has identified many categories: face areas vs. objects; place minus face. By presented images and mapping which voxels in the brain are activated one can see how the 2000 categories are mapped in the brain using wordmap as the labels. Similar concepts are mapped to similar locations in the brain, but specific items in the semantic visual system interact with the semantic language areas – so a ‘dog’ can active many areas so it can be used in different ways and can be unified as needed. Each person will have a different mapping depending on their previous good and bad experiences with dogs.
He talked about other topics including the challenges of determining how things are stored in places: Fourier power, object categories, subjective distance. In order to activate any of these areas in isolation, one needs enough stimulus to activate the earlier layers. They have progress by building a decoder from the knowledge of the voxel which run from the brain area backwards to create stimulus. A blood flow model are used with a 2 second minimum sampling period. But there is lots of continuity so they can reconstruct a series of images.
Intermediate visual area
Intermediate visual areas between the lower and higher levels of processing are hard to understand – looks at V4. They respond to shapes of intermediate complexity, but not much else like a curvature detector. Using fMRI they know what image features correlate with specific areas, but there is no strong indication differentiating one layer from another. Using the Utah array, they need to do a log-polar transform to improve prediction in V4. Using a receptor field model, they can create a predictor frame and match brain activity to images that gave the largest response.
To improve prediction on V4, Utah arrays need to do a log-polar transform. However, the images are messy and predicting V4 is not the same as understanding V4.
Finally, he talked about attenuation and tuning effects on single neurons. In an experiment in subjects watched a movie and were asked to search for either humans or vehicles, there were changes in the semantic map based on the search criterion. These tuning shift effects are a function of distance to visual periphery: Attentional effects are small in V1 and get larger in the ensuing layers.
In the Q&A, he made the following points:
- The visual word form area in the brain becomes active as you learn to read. This change does not occur for people who are illiterate.
- One of the experimental assumptions is that the system is stationary, so there is not adaptation. If adaptation does occur, then they cannot compute a noise ceiling for the signals.
[Neural nets take inspiration from the neurobiology, especially the creation of convolutional neural nets, but there is now feedback with neurobiology using the tools created in machine learning to explore possible models of brain mapping. Does the pervasive existence of Gabor filters lead to an argument that their presence indicates that natural images are closely allied with fractal patterns?]
Applications of #DeepLearning in #Healthcare
Posted on March 28th, 2017
03/28/2017 @NYU Courant Institute (251 Mercer St, New York, NY)
Sumit Chopra, the head of A.I. Research @Imagen Technologies, introduced the topic by saying that the two areas in our lives that will be most affect by AI are healthcare and driverless cars.
Healthcare data can be divided into
- Other – cell phones, etc.
Payer data – from insurance provider
Clinical data – incomplete since hospitals don’t share their datasets; digital form with privacy concerns
Payer data more complete unless the patient switches the payer, less detail.
He focuses on medical imaging – mainly diagnostic radiology – 600mm studies in the U.S., but shortage of skilled radiologists. Prevalence of errors. The images are very large size, high resolution, low contrast, highly subtle cues => radiology is hard to do well
Possible solution: pre-train a standard model: Alexnet/VGG/… on a small number of images, but this might not work since the signal is subtle.
Also radiology reports, which could be used for supervised training, are unstructured and it’s hard to tell what the report tells you. => weak labels at best
Much work has been done on this problem, usually using deep convolutional neural nets.
First step: image registration = rotate & crop.
Train a deep convolutional network (registration network) , the send to a detection network for binary segmentation.
Could use generative models for images to train doctors
Leverage different modalities of data
Sumit has round that a random search of hyperparameter space works better than either grid search or optimizer search.
Intro to #DeepLearning using #PyTorch
Posted on February 21st, 2017
02/21/2017 @ NYU Courant Institute (251 Mercer St, New York, NY)
Soumith Chintala @Facebook first talked about trends in the cutting edge of machine learning. His main point was that the world is moving from fixed agents to dynamic neural nets in which agents restructure themselves over time. Currently, the ML world is dominated by static datasets + static model structures which learn offline and do not change their structure without human intervention.
He then talked about PyTorch which is the next generation of ML tools after Lua #Torch. In creating PyTorch they wanted to keep the best features of LuaTorch, such as performance and extensibility while eliminating rigid containers and allowing for execution on multiple-GPU systems. PyTorch is also designed so programmers can create dynamic neural nets.
Other features include
- Kernel fusion – take several objects and fuse them into a single object
- Order of execution – reorder objects for faster execution
- Automatic work placement when you have multiple GPUs
PyTorch is available for download on http://pytorch.org and was released Jan 18, 2017.
Currently, PyTorch runs only on Linux and OSX.