AI ventures need more scientific due diligence
Posted on June 17th, 2017
06/14/2017 @Ebay, 625, 6th Ave, NY 3rd floor
Praveen Paritosh @Google gave a thought provoking presentation arguing that the current popularity of machine learning may be short lived unless additional rigor is introduced into the field. Such a fall in interest happened in the late 1980’s which became known as the “#AI winter”. He argues that greater openness is needed in sharing the successful methods applied to data sets and we need standardization in the benchmarks of success.
I believe that the main issue is a lack of theory explaining how the success methods work and why they are more successful than other methods. The theory needs to use a model of our understanding of the structure of the world to show why a particular method succeeds and why other methods are less successful. This paradigm would also give us a better understanding of the limits of such methods and why the world is structured as it is. It will also give us a cumulative knowledge base upon which to grow new methods.
This point of view is founded on the work of Karl Popper who argued that a theory in the empirical sciences can never be proven, but it can be falsified, meaning that it can and should be scrutinized by decisive experiments. Here, theory is essential for science since without theory there is not ability to test the validity of an approach that claims to be science.
One path to generating theory starts with the nature of the physical world and the way humans perceive the world. We assume that the physical world is made up of basic building blocks that assemble themselves in a large, but restricted, number of ways such as that generated by a fractal organization. Organisms, including humans, that take advantage of these regularities have a competitive advantage and have developed effective structures and DNA.
Appeals to greater standardization of the methods of testing machine learning are based on an inductivist approach which argues that science proceeds by incremental refinements in theory as theory and observations bootstrap themselves using enumerative induction toward universal laws. This approach is generally considered no longer tenable given the 20th century work of Popper, Thomas Kuhn, and other postpostivist philosophers of science including Paul Feyerabend, Imre Lakatos, and Larry Laudan.
Investing using #DeepLearning, #MacroTrading and #Chatbots
Posted on June 2nd, 2017
Qplum, 185 Hudson Street , Jersey City, suite 1620
Mansi Singhal and Gaurav Chakravorty @Qplum gave two presentations on how Qplum uses machine learning within a systematic macro investment strategy. Mansi talked about how a macro economic world view is used to focus the ML team on target markets. She walked the audience through an economic analysis of the factors driving the U.S. residential housing market and how an understanding of the drivers (interest rates, GDP, demographics,…) and anticipation of future economic trends (e.g. higher interest rates) would lead them to focus on (or not consider) that market for further analysis by the ML group.
Gaurav (http://slides.com/gchak/deep-learning-making-trading-a-science#/) talked about how they use an AutoEncoder to better understand the factors driving a statistical arbitrage strategy. Here, instead of using a method like principal components analysis, they use a deep learning algorithm to determine the factors driving the prices of a group of stocks. The model uses a relatively shallow neural net. To understand the underlying factors, they look at which factors are the largest driver of current market moves and determine the historical time periods when this factor has been active. One distinction between their factor models and classic CAPM models is that non-linearities are introduced by the activation functions within each layer of the neural net.
Next, Aziz Lookman talked about his analysis showing that an analysis of county-by-county unemployment rates affects the default rates (and therefore the investment returns) on loans within Lending Club.
Lastly, Hardik Patel @Qplum talked about the opportunities and challenges of creating a financial chatbot. The opportunity is that the investment goals and concerns are unique for each customer, so each will have different questions and need different types of information and advice.
The wide variety of questions and answers challenges the developer so their approach has been to develop and LSTM model of the questions which will point the bot to a template that will generate the answer. Their initial input will use word vectors and bag of words methods to map questions to categories.
#Post-Selection #StatisticalInference in the era of #MachineLearning
Posted on May 6th, 2017
05/04/2017 @ ColumbiaUniversity, DavisAuditorium, CEPSR
Robert Tibshirani @StanfordUniversity talked about the adjusting the cutoffs for statistical significance testing of multiple null hypotheses. The #Bonferroni Correction has been used to adjustments for testing multiple hypothesis when the hypotheses are statistically independent. However, with the advent of #MachineLearning techniques, the number of possible tests and their interdependence has exploded.
This is especially true with the application of machine learning algorithms to large data sets with many possible independent variables which often use forward stepwise or Lasso regression procedures. Machine learning methods often use #regularization methods to avoid #overfitting the data such as data splitting into training, test and validation sets. For big data applications, these may be adequate since the emphasis on is prediction, not inference. Also the large size of the data set offsets issues such as the lower of power in the statistical tests conducted on a subset of the data.
Robert proposed a model for incremental variable selection in which each sequential test sliced off parts of the distribution for subsequent tests creating a truncated normal upon which one can assess the probability of the null hypothesis. This method of polyhedral selection works for a stepwise regression and well as a lasso regression with a fixed lambda.
When the value of lambda is determined by cross-validation, can use this method by adding 0.1 * sigma noise to the y values. This adjustment retains the power of the test and does not underestimate the probability of accepting the null hypothesis. This method can also be extended to other methods such as logistic regression, Cox proportional hazards model, graphics lasso.
The method can also be extended to consider the number of factors to use in the regression. This goals of this methodology are similar to those described by Bradley #Efron in his 2013 JASA paper on bootstrapping (http://statweb.stanford.edu/~ckirby/brad/papers/2013ModelSelection.pdf) and random matrix theory used to determine the number of principal components in the data as described by the #Marchenko-Pastur distribution.
There is a package in R: selectiveInference
Further information can be found in a chapter on ‘Statistical Learning with Sparsity’ by Hastie, Tibshirani, Wainwright (online pdf) and ‘Statistical Learning and selective inference’ (2015) Jonathan Taylor and Robert J. Tibshirani (PNAS)
Building #ImageClassification models that are accurate and efficient
Posted on April 28th, 2017
04/28/2017 @NYUCourantInstitute, 251 Mercer Street, NYC, room 109
Laurens van der Maaten @Facebook spoke about some of the new technologies used by Facebook to increase accuracy and lower processing needed in image identification.
He first talked about residual networks which they are developing to replace standard convolutional neural networks. Residual networks can be thought of as a series of blocks each of which is a tiny #CNN:
- 1×1 layer, like a PCA
- 3×3 convolution layer
- 1×1 layer, inverse PCA
The raw input is added to the output of this mini-network followed by a RELU transformation.
These transformations extract features while keeping information that is input into the block, so the map is changed, but does not need to be re-learned from scratch. This eliminates some problems with vanishing gradients in the back propagation as well as the unidentifiabiliy problem.
Blocks when executed in sequence gradually add features, but removing a block after training hardly degrades performance (Huang et al 2016). From this observation they concluded that the blocks were performing two functions: detect new features and pass through some of the information in the raw input. Therefore, this structure could be made more efficient if they pass through the information yet allowed each block to only extract features.
DenseNets gives each block in each layer access to all features in the layer before it. The number of feature maps increases in each layer, so there is the possibility of a combinatorial explosion of units with each layer. Fortunately, this does not happen as each layer adds 32 new modules but the computation is more efficient, so the aggregate amount of computation for a given level of accuracy decreases when using DenseNet in favor of ResNet while accuracy improves.
Next Laurens talked about making image recognition more efficient, so a larger number of images could be processed with the same level of accuracy in a shorter average time.
He started by noting that some images are easier to identify than others. So, the goal is to quickly identify the easy images and only spend further processing time on the harder, more complex images.
The key is noting that easy images are classified using only a coarse grid, but then harder images would not be classifiable. On the other hand, using a fine grid makes it harder to classify the easy image.
Laurens described a hybrid 2-d network in which there are layers analyzing the image using the coarse grid and layers analyzing the fine grid. The fine grain blocks occasionally feed into the coarse grain blocks. At each layer outputs are tested to see if the confidence level for any image exceeds a threshold. Once the threshold is exceeded, processing is stopped and the prediction is output. In this way, when the decision is easy, this conclusion is arrived at quickly. Hard images continue further down the layers and require more processing.
By estimating the percentage of the classifier exiting at each threshold, then can time the threshold levels so that more images can be processed within a given time budget
During the Q&A, Laurens said
- To avoid overfitting the model, they train the network on both the original images as well as these same images after small transformation have been done on each image.
- They are still working to expand the #DenseNet to see its upper limits on accuracy
- He is not aware of any neurophysiological structures in the human brain that correspond to the structure of blocks in #ResNet / DenseNet.
Structured and Scalable Probabilistic Topic Models
Posted on March 24th, 2017
Data Science Institute Colloquium
03/23/2017 Schapiro Hall (CEPSR), Davis Auditorium @Columbia University
John Paisley, Assistant Professor of Electrical Engineering, spoke about models to extract topics and the their structure from text. He first talked about topic models in which global variables (in this case words) were extracted from documents. In this bag-of-words approach, the topic proportions were the local variables specific to each document, while the words were common across documents.
Latent Dirichlet Analysis captures the frequency of each word. John also noted that #LDA can be use for things other than topic modeling.
- Capture assumptions with new distributions – is the new thing different?
- Embedded into more complex model structures
Next he talked about moving beyond the “flat” LDA model in which
- No structural dependency among the topics – e.g. not a tree model
- All combinations of topics are a prior equally probable
To a Hierarchical topics model in which words are placed as nodes in a tree structure with more general topics are in the root and inner branches. He uses #Bayesian inference to start the tree (assume an infinite number of branches coming out of each node) with each document a subtree within the overall tree. This approach can be further extended to a Markov chain which shows the transitions between each pair of words.
He next showed how the linkages can be computed using Bayesian inference to calculate posterior probabilities for both local and global variables: The joint likelihood of the global and local variables can be factors into a product which is conditional on the probabilities of the global variables.
He next compared the speed-accuracy trade off for three methods
- Batch inference – ingest all documents at once, so its very slow, but eventually optimal
- optimize the probability estimates for the local variables across documents (could be very large)
- optimize the probability estimates for the global variables.
- Stochastic inference – ingest small subsets of the documents
- optimize the probability estimates for the local variables across documents (could be very large)
- take a step toward to improve the probability estimates for the global variables.
- Repeat using the next subset of the documents
- MCMC, should be more accurate, but #MCMC is incredibly slow, so it can only be run on a subset
John showed that the stochastic inference method converges quickest to an accurate out-sample model.
Building an #AI #AutonomousAgent using #SupervisedLearning with @DennisMortensen
Posted on March 23rd, 2017
03/21/2017 @ Rise, 43 West 23rd Street, NY, 2nd floor
In mid 2013, email@example.com started x.ai to employ machine learning to set up meetings. After an introduction to the software, Dennis talked about the challenges for creating a conversational agent to act as your assistant setting up business meetings.
He talked about the 3 processes within the agent: NLU + reasoning + NLG
Natural Language Understanding needs to define the universe – what is it we can do and what is it that we cannot do and will not do?
Natural Language Understanding (NLU) Challenges
- Define intents then hire AI trainers. Need to get the intents right since it’s expensive to change to a different scheme
- What data set do we align to? What are the guidelines for labeling? Coders need to learn and remember the rules defining all the intents. Need to keep it compact, but not too much so
- They have 101 AI trainers full time. On what software do they label the words? Need a custom-built annotation platform. Spent 2 years building it.
- How do people want the agent to behave? Manually determine what is supposed to happen. This will create a new intent, but this often requires changes in the coding of the NLU
- Some of the things humans want to do are very complicated. Especially common sense
- don’t do meeting after 6:00, but if there is one at 6:15, there is a reason for this happening.
- a 6:30 PM call to Singapore might be a good idea.
- When to have a meeting and when to have a phone call
Natural Language Generation (NLG) challenges
- They have 2 interaction designers
- Need to inject empathy if it’s appropriate. For instance if there is a change in schedule, we need to respond appropriately: understanding initially and more assertive if the change needs to be unchanged. Also need to honor requests to speak in a given language.
They evaluate the performance of the software when being used by a client by
- customer-centric metrics, such as the number of schedule changes
- is the customer happy?
NYAI#7: #DataScience to Operationalize #ML (Matthew Russell) & Computational #Creativity (Dr. Cole)
Posted on November 22nd, 2016
11/22/2016 Risk, 43 West 23rd Street, NY 2nd floor
Speaker 1: Using Data Science to Operationalize Machine Learning – (Matthew Russell, CTO at Digital Reasoning)
Speaker 2: Top-down vs. Bottom-up Computational Creativity – (Dr. Cole D. Ingraham DMA, Lead Developer at Amper Music, Inc.)
Matthew Russell @DigitalReasoning spoke about understanding language using NLP, relationships among entities, and temporal relationship. For human language understanding he views technologies such as knowledge graphs and document analysis is becoming commoditized. The only way to get an advantage is to improve the efficiency of using ML: KPI for data analysis is the number of experiments (tests an hypothesis) that can be run per unit time. The key is to use tools such as:
- Vagrant – allow an environmental setup.
- Jupyter Notebook – like a lab notebook
- Git – version control
- Automation –
He wants highly repeatable experiments. The goal is to speed up the number of experiments that can be conducted per unit time.
He then talked about using machines to read medical report and determine the issues. Negatives can be extracted, but issues are harder to find. Uses an ontology to classify entities.
He talked about experiments on models using ontologies. The use of a fixed ontology depends on the content: the ontology of terms for anti-terrorism evolves over time and needs to be experimentally adjusted over time. Medical ontology is probably most static.
In the second presentation, Cole D. Ingraham @Ampermusic talked about top-down vs bottom-up creativity in the composition of music. Music differs from other audio forms since it has a great deal of very large structure as well as the smaller structure. ML does well at generating good audio on a small time frame, but Cole thinks it is better to apply theories from music to create the larger whole. This is a combination of
Top-down: novel&useful, rejects previous ideas – code driven, “hands on”, you define the structure
Bottom-up: data driven – data driven, “hands off”, you learn the structure
He then talked about music composition at the intersection of Generation vs. analysis (of already composed music) – can do one without the other or one before the other
To successfully generate new and interesting music, one needs to generate variance. Composing music using a purely probabilistic approach is problematic as there is a lack of structure. He likes the approach similar to replacing words with their synonyms which do not fundamentally change the meaning of the sentence, but still makes it different and interesting.
It’s better to work on deterministically defined variance than it is to weed out undesired results from nondeterministic code.
As an example he talked about Wavenet (google deepmind project) which input raw audio and output are raw audio. This approach works well for improving speech synthesis, but less well for music generation as there is no large scale structural awareness.
Cole then talked about Amper, as web site that lets users create music with no experience required: fast, believable, collaborative
They like a mix of top-down and bottom-up approaches:
- Want speed, but neural nets are slow
- Music has a lot of theory behind it, so it’s best to let the programmers code these rules
- Can change different levels of the hierarchical structure within music: style, mood, can also adjust specific bars
Runtime written in Haskell – functional language so its great for music
NYAI#5: Neural Nets (Jason Yosinski) & #ML For Production (Ken Sanford)
Posted on August 24th, 2016
08/24/2016 @Rise 43 West 23rd Street, NY, 2nd floorPreview Changes
Jason Yosinski@GeometricTechnology spoke about his work on #NeuralNets to generate pictures. He started by talking about machine learning with feedback to train a robot to move more quickly and using feedback to computer-generate pictures that are appealing to humans.
Jason next talked about AlexNet, based on work by Krizhevsky et al 2012, to classify images using a neural net with 5 convolutional layers (interleaved with max pooling and contrast layers) plus 3 fully connected layers at the end. The net with 60 million parameters was training on ImageNet which contains over 1mm images. His image classification Code is available on http://Yosinski.com.
Jason talked about how the classifier thinks about categories when it is not being trained to identify that category. For instance, the network may learn about faces even though there is no human category since it helps the system detect things such as hats (above a face) to give it context. It also identifies text to give it context on other shapes it is trying to identify.
He next talked about generating images by inputting random noise and randomly changing pixels. Some changes will cause the goal (such as a ‘lions’) to increase in confidence. Over many random moves, the goal increases in its confidence level. Jason showed many random images that elicited high levels of confidence, but the images often looked like purple-green slime. This is probably because the network, while learning, immediately discards the overall color of the image and is therefore insensitive to aberrations from normal colors. (See Erhan et al 2009)
[This also raises the question of how computer vision is different from human vision. If presented with a blue colored lion, the first reaction of a human might be to note how the color mismatches objects in the ‘lion’ category. One experiment would be to present the computer model with the picture of a blue lion and see how it is classified. Unlike computers, humans encode information beyond their list of items they have learned and this encoding includes extraneous information such as color or location. Maybe the difference is that humans incorporate a semantic layer that considers not only the category of the items, but other characteristics that define ‘lion-ness’. Color may be more central to human image processing as it has been conjectured that we have color vision so we can distinguish between ripe and rotten fruits. Our vision also taps into our expectation to see certain objects within the world and we are primed to see those objects in specific contexts, so we have contextual information beyond what is available to the computer when classifying images.]
To improve the generated pictures of ‘lions’, he next used a generator to create pictures and change them until they get a picture which has high confidence of being a ‘lion’. The generator is designed to create identifiable images. The generator can even produce pictures on objects that it has not been trained to paint. (Need to apply regularization to get better pictures for the target.)
Slides at http://s.yosinski.com/nyai.pdf
In the second talk, Ken Sanford @Ekenomics and H20.AI talked about the H2O open source project. H2O is a machine learning engine that can run in R, Python,Java, etc.
Ken emphasized how H2O (a multilayer feed forward neural network) provides a platform that uses the Java Score Code engine. This easies the transition from the model developed in training and the model used to score inputs in a production environment.
He also talked about the Deep Water project which aims to allow other open source tools, such as MXNET, Caffe, Tensorflow,… (CNN, RNN, … models) to run in the H2O environment.
Automatically scalable #Python & #Neuroscience as it relates to #MachineLearning
Posted on June 28th, 2016
06/28/2016 @Rise, 43 West 23rd Street, NY, 2nd floor
Braxton McKee (@braxtonmckee ) @Ufora first spoke about the challenges of creating a version of Python (#Pyfora) that naturally scales to take advantage of the hardware to handle parallelism as the problem grows.
Braxton presented an example in which we compute the minimum distance from target points a larger universe of points base on their Cartesian coordinates. This is easily written for small problems, but the computation needs to be optimized when computing this value across many cpu’s.
However, the allocation across cpu’s depends on the number of targets relative to the size of the point universe. Instead of trying to solve this analytically, they use a #Dynamicrebalancing strategy that splits the task and adds resources to the subtasks creating bottlenecks.
This approach solves many resource allocation problems, but still faces challenges
- nested parallelism. They look for parallelism within the code and look for bottlenecks at the top level of parallelism and split the task into subtasks at that level, …
- the data do not fit in memory. They break tasks into smaller tasks. They also have each task know which other caches hold data, so they can be accessed directly without going to slower main memory
- different types of architectures (such as gpu’s) require different types of optimization
- the optimizer cannot look inside python packages, so cannot optimize a bottleneck within a package.
- is a just-in-time compiler that moves stack frames from machine-to-machine and senses how to take advantage of parallelism
- tracks what data a thread is using
- dynamically schedules threads and data
- takes advantage of mutability which allows the compiler to assume that functions do no change over time so the compiler can look inside the function when optimizing execution
- is written on top of another language which allows for the possibility of porting the method to other languages
In the second presentation, Jeremy Freeman @Janelia.org spoke about the relationship between neuroscience research and machine learning models. He first talking about the early works on understanding the function of the visual cortex.
Findings by Hubel & Wiesel in1959 have set the foundation for visual processing models for the past 40 years. They found that Individual neurons in the V1 area of the visual cortex responded to the orientation of lines in the visual field. These inputs fed neurons that detect more complex features, such as edges, moving lines, etc.
Others also considered systems which have higher level recognition and how to train a system. These include
Perceptrons by Rosenblatt, 1957
Neocognitrons by Fukushima, 1980
Hierarchical learning machines, Lecun, 1985
Back propagation by Rumelhart, 1986
His doctoral research looked at the activity of neurons in V2 area. They found they could generate high order patterns that some neurons discriminate among.
But in 2012, there was a jump in performance of neural nets – U. of Toronto
By 2014, some of the neural network algos perform better than humans and primates, especially in the area of image processing. This has lead to many advances such as Google deepdream which combines images and texture to create an artistic hybrid image.
Recent scientific research allows one to look at thousands of neurons simultaneously. He also talked about some of his current research which uses “tactile virtual reality” to examine the neural activity as a mouse explores a maze (the mouse walks on a ball that senses it’s steps as it learns the maze).
Jeremy also spoke about Model-free episodic control for complex sequential tasks requiring memory and learning. ML research has created models such as LSTM and Neural Turing Nets which retain state representations. Graham Taylor has looked at neural feedback modulation using gates.
He also notes that there are similar functionalities between the V1 area in the visual cortex, the A1 auditory area, and the S1, tactile area.
To find out more, he suggested visiting his github site: Freeman-lab and looking the web site neurofinder.codeneuro.org.
DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.
Posted on June 18th, 2016
06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)
The four speakers were
- Nitay Joffe, Founder and CTO of ActionIQ (next-generation data platform for marketing and consumer data)
- Adam Kanouse, CTO of Narrative Science (transforms data into meaningful and insightful narratives)
- Neha Narkhede, Founder and CTO of Confluent (real-time data platform built around Apache Kafka)
- Christopher Nguyen, Founder and CEO of Arimo (data intelligence platform)
Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.
They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.
Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.