AI ventures need more scientific due diligence
Posted on June 17th, 2017
06/14/2017 @Ebay, 625, 6th Ave, NY 3rd floor
Praveen Paritosh @Google gave a thought provoking presentation arguing that the current popularity of machine learning may be short lived unless additional rigor is introduced into the field. Such a fall in interest happened in the late 1980’s which became known as the “#AI winter”. He argues that greater openness is needed in sharing the successful methods applied to data sets and we need standardization in the benchmarks of success.
I believe that the main issue is a lack of theory explaining how the success methods work and why they are more successful than other methods. The theory needs to use a model of our understanding of the structure of the world to show why a particular method succeeds and why other methods are less successful. This paradigm would also give us a better understanding of the limits of such methods and why the world is structured as it is. It will also give us a cumulative knowledge base upon which to grow new methods.
This point of view is founded on the work of Karl Popper who argued that a theory in the empirical sciences can never be proven, but it can be falsified, meaning that it can and should be scrutinized by decisive experiments. Here, theory is essential for science since without theory there is not ability to test the validity of an approach that claims to be science.
One path to generating theory starts with the nature of the physical world and the way humans perceive the world. We assume that the physical world is made up of basic building blocks that assemble themselves in a large, but restricted, number of ways such as that generated by a fractal organization. Organisms, including humans, that take advantage of these regularities have a competitive advantage and have developed effective structures and DNA.
Appeals to greater standardization of the methods of testing machine learning are based on an inductivist approach which argues that science proceeds by incremental refinements in theory as theory and observations bootstrap themselves using enumerative induction toward universal laws. This approach is generally considered no longer tenable given the 20th century work of Popper, Thomas Kuhn, and other postpostivist philosophers of science including Paul Feyerabend, Imre Lakatos, and Larry Laudan.
Investing using #DeepLearning, #MacroTrading and #Chatbots
Posted on June 2nd, 2017
Qplum, 185 Hudson Street , Jersey City, suite 1620
Mansi Singhal and Gaurav Chakravorty @Qplum gave two presentations on how Qplum uses machine learning within a systematic macro investment strategy. Mansi talked about how a macro economic world view is used to focus the ML team on target markets. She walked the audience through an economic analysis of the factors driving the U.S. residential housing market and how an understanding of the drivers (interest rates, GDP, demographics,…) and anticipation of future economic trends (e.g. higher interest rates) would lead them to focus on (or not consider) that market for further analysis by the ML group.
Gaurav (http://slides.com/gchak/deep-learning-making-trading-a-science#/) talked about how they use an AutoEncoder to better understand the factors driving a statistical arbitrage strategy. Here, instead of using a method like principal components analysis, they use a deep learning algorithm to determine the factors driving the prices of a group of stocks. The model uses a relatively shallow neural net. To understand the underlying factors, they look at which factors are the largest driver of current market moves and determine the historical time periods when this factor has been active. One distinction between their factor models and classic CAPM models is that non-linearities are introduced by the activation functions within each layer of the neural net.
Next, Aziz Lookman talked about his analysis showing that an analysis of county-by-county unemployment rates affects the default rates (and therefore the investment returns) on loans within Lending Club.
Lastly, Hardik Patel @Qplum talked about the opportunities and challenges of creating a financial chatbot. The opportunity is that the investment goals and concerns are unique for each customer, so each will have different questions and need different types of information and advice.
The wide variety of questions and answers challenges the developer so their approach has been to develop and LSTM model of the questions which will point the bot to a template that will generate the answer. Their initial input will use word vectors and bag of words methods to map questions to categories.
#Post-Selection #StatisticalInference in the era of #MachineLearning
Posted on May 6th, 2017
05/04/2017 @ ColumbiaUniversity, DavisAuditorium, CEPSR
Robert Tibshirani @StanfordUniversity talked about the adjusting the cutoffs for statistical significance testing of multiple null hypotheses. The #Bonferroni Correction has been used to adjustments for testing multiple hypothesis when the hypotheses are statistically independent. However, with the advent of #MachineLearning techniques, the number of possible tests and their interdependence has exploded.
This is especially true with the application of machine learning algorithms to large data sets with many possible independent variables which often use forward stepwise or Lasso regression procedures. Machine learning methods often use #regularization methods to avoid #overfitting the data such as data splitting into training, test and validation sets. For big data applications, these may be adequate since the emphasis on is prediction, not inference. Also the large size of the data set offsets issues such as the lower of power in the statistical tests conducted on a subset of the data.
Robert proposed a model for incremental variable selection in which each sequential test sliced off parts of the distribution for subsequent tests creating a truncated normal upon which one can assess the probability of the null hypothesis. This method of polyhedral selection works for a stepwise regression and well as a lasso regression with a fixed lambda.
When the value of lambda is determined by cross-validation, can use this method by adding 0.1 * sigma noise to the y values. This adjustment retains the power of the test and does not underestimate the probability of accepting the null hypothesis. This method can also be extended to other methods such as logistic regression, Cox proportional hazards model, graphics lasso.
The method can also be extended to consider the number of factors to use in the regression. This goals of this methodology are similar to those described by Bradley #Efron in his 2013 JASA paper on bootstrapping (http://statweb.stanford.edu/~ckirby/brad/papers/2013ModelSelection.pdf) and random matrix theory used to determine the number of principal components in the data as described by the #Marchenko-Pastur distribution.
There is a package in R: selectiveInference
Further information can be found in a chapter on ‘Statistical Learning with Sparsity’ by Hastie, Tibshirani, Wainwright (online pdf) and ‘Statistical Learning and selective inference’ (2015) Jonathan Taylor and Robert J. Tibshirani (PNAS)
Building #ImageClassification models that are accurate and efficient
Posted on April 28th, 2017
04/28/2017 @NYUCourantInstitute, 251 Mercer Street, NYC, room 109
Laurens van der Maaten @Facebook spoke about some of the new technologies used by Facebook to increase accuracy and lower processing needed in image identification.
He first talked about residual networks which they are developing to replace standard convolutional neural networks. Residual networks can be thought of as a series of blocks each of which is a tiny #CNN:
- 1×1 layer, like a PCA
- 3×3 convolution layer
- 1×1 layer, inverse PCA
The raw input is added to the output of this mini-network followed by a RELU transformation.
These transformations extract features while keeping information that is input into the block, so the map is changed, but does not need to be re-learned from scratch. This eliminates some problems with vanishing gradients in the back propagation as well as the unidentifiabiliy problem.
Blocks when executed in sequence gradually add features, but removing a block after training hardly degrades performance (Huang et al 2016). From this observation they concluded that the blocks were performing two functions: detect new features and pass through some of the information in the raw input. Therefore, this structure could be made more efficient if they pass through the information yet allowed each block to only extract features.
DenseNets gives each block in each layer access to all features in the layer before it. The number of feature maps increases in each layer, so there is the possibility of a combinatorial explosion of units with each layer. Fortunately, this does not happen as each layer adds 32 new modules but the computation is more efficient, so the aggregate amount of computation for a given level of accuracy decreases when using DenseNet in favor of ResNet while accuracy improves.
Next Laurens talked about making image recognition more efficient, so a larger number of images could be processed with the same level of accuracy in a shorter average time.
He started by noting that some images are easier to identify than others. So, the goal is to quickly identify the easy images and only spend further processing time on the harder, more complex images.
The key is noting that easy images are classified using only a coarse grid, but then harder images would not be classifiable. On the other hand, using a fine grid makes it harder to classify the easy image.
Laurens described a hybrid 2-d network in which there are layers analyzing the image using the coarse grid and layers analyzing the fine grid. The fine grain blocks occasionally feed into the coarse grain blocks. At each layer outputs are tested to see if the confidence level for any image exceeds a threshold. Once the threshold is exceeded, processing is stopped and the prediction is output. In this way, when the decision is easy, this conclusion is arrived at quickly. Hard images continue further down the layers and require more processing.
By estimating the percentage of the classifier exiting at each threshold, then can time the threshold levels so that more images can be processed within a given time budget
During the Q&A, Laurens said
- To avoid overfitting the model, they train the network on both the original images as well as these same images after small transformation have been done on each image.
- They are still working to expand the #DenseNet to see its upper limits on accuracy
- He is not aware of any neurophysiological structures in the human brain that correspond to the structure of blocks in #ResNet / DenseNet.
Structured and Scalable Probabilistic Topic Models
Posted on March 24th, 2017
Data Science Institute Colloquium
03/23/2017 Schapiro Hall (CEPSR), Davis Auditorium @Columbia University
John Paisley, Assistant Professor of Electrical Engineering, spoke about models to extract topics and the their structure from text. He first talked about topic models in which global variables (in this case words) were extracted from documents. In this bag-of-words approach, the topic proportions were the local variables specific to each document, while the words were common across documents.
Latent Dirichlet Analysis captures the frequency of each word. John also noted that #LDA can be use for things other than topic modeling.
- Capture assumptions with new distributions – is the new thing different?
- Embedded into more complex model structures
Next he talked about moving beyond the “flat” LDA model in which
- No structural dependency among the topics – e.g. not a tree model
- All combinations of topics are a prior equally probable
To a Hierarchical topics model in which words are placed as nodes in a tree structure with more general topics are in the root and inner branches. He uses #Bayesian inference to start the tree (assume an infinite number of branches coming out of each node) with each document a subtree within the overall tree. This approach can be further extended to a Markov chain which shows the transitions between each pair of words.
He next showed how the linkages can be computed using Bayesian inference to calculate posterior probabilities for both local and global variables: The joint likelihood of the global and local variables can be factors into a product which is conditional on the probabilities of the global variables.
He next compared the speed-accuracy trade off for three methods
- Batch inference – ingest all documents at once, so its very slow, but eventually optimal
- optimize the probability estimates for the local variables across documents (could be very large)
- optimize the probability estimates for the global variables.
- Stochastic inference – ingest small subsets of the documents
- optimize the probability estimates for the local variables across documents (could be very large)
- take a step toward to improve the probability estimates for the global variables.
- Repeat using the next subset of the documents
- MCMC, should be more accurate, but #MCMC is incredibly slow, so it can only be run on a subset
John showed that the stochastic inference method converges quickest to an accurate out-sample model.
Critical Approaches to #DataScience & #MachineLearning
Posted on March 18th, 2017
3/17/2017 @Hunter College, 68th & Lexington Ave, New York, Lang Theater
Geetu Ambwani @HuffingtonPost @geetuji spoke about how the Huffington Post is looking at data as a way around the filter bubble in which separates individuals from views that are contrary to their previously help beliefs. Filter bubbles are believed to be a major reason for the current levels of polarization in society.
The talked about ways that the media can respond to this confirmation bias
- Show opposing point of view
- Show people their bias
- Show source crediability
For instance, Chrome and Buzzfeed have tools that will insert opposing points of view in your news feed. Flipfeed enables you to easily load another feed. AlephPost clusters articles and color codes them indicating the source’s vantage view. However, showing people opposing views can backfire.
Second, Readacross the spectrum will show you your biases. Politico will show you how blue or red you by indicating the color of your information sources.
Third, one can show source credibility and where it lies on the political spectrum
However, there is still a large gap between what is produced by the media and what consumers want. Also this does not remove the problem that ad dollars are given for “engagement” which means that portals are incented to continue delivering what the reader wants.
Next, Justin Hendrix @NYC Media Lab (consortium of universities started by the city of NY) talked about emerging media technologies. Examples were
- Vidrovr – teach computers how to watch video – produce searchable tags.
- Data selfi project – from the new school. See the data which Facebook has on us. A chrome extension. 100k downloads in the first week.
- Braiq – connect the mind with the on-board self-driving software on cars. Build software which is more reactive to the needs and wants of the passenger. Technology in the headrest and other inputs that will talk to the self-driving AI.
The follow up discussion covered a wide range of topics including
- The adtech fraud is known, but no one has the incentive to address. Fake audience – bots clicking sites
- Data sources are readily available lead by the Twitter or Facebook APIs. Get on github for open source code on downloading data
- Was the 20th century an aberration as to how information was disseminated? We might just be going back to a world with pools of information.
- What are the limits on what points of view any media company is willing to explore?
- What is the future of work and the social contract as jobs disappear?
Intro to #DeepLearning using #PyTorch
Posted on February 21st, 2017
02/21/2017 @ NYU Courant Institute (251 Mercer St, New York, NY)
Soumith Chintala @Facebook first talked about trends in the cutting edge of machine learning. His main point was that the world is moving from fixed agents to dynamic neural nets in which agents restructure themselves over time. Currently, the ML world is dominated by static datasets + static model structures which learn offline and do not change their structure without human intervention.
He then talked about PyTorch which is the next generation of ML tools after Lua #Torch. In creating PyTorch they wanted to keep the best features of LuaTorch, such as performance and extensibility while eliminating rigid containers and allowing for execution on multiple-GPU systems. PyTorch is also designed so programmers can create dynamic neural nets.
Other features include
- Kernel fusion – take several objects and fuse them into a single object
- Order of execution – reorder objects for faster execution
- Automatic work placement when you have multiple GPUs
PyTorch is available for download on http://pytorch.org and was released Jan 18, 2017.
Currently, PyTorch runs only on Linux and OSX.
#ComputerScience and #DigitalHumanities
Posted on December 8th, 2016
PRINCETON #ACM / #IEEE-CS CHAPTERS DECEMBER 2016 JOINT MEETING
12/08/2016 @Princeton University Computer Science Building, Small Auditorium, Room CS 105, Olden and William Streets, Princeton NJ
Brian Kernighan @Princeton University spoke about how computers can assist in understanding research topics in the humanities.
He started by presenting examples of web sites with interactive tools for exploring historical material
- Explore a northern and a southern town during the Civil War: http://valley.lib.virginia.edu/
- Expedia for a traveler across ancient Roman: http://orbis.stanford.edu/
- The court records in London from 1674-1913: https://www.oldbaileyonline.org/
- Hemingway and other literary stars in Paris from the records of Sylvia Beach
Brian then talked about the challenges of converting the archival data: digitize, meta tag, store, query, present results, make available to the public
In preparation for teaching a class this fall on digital humanities, he talked about his experience extracting information from a genealogy based on the descendents of Nicholas Cady (https://archive.org/details/descendantsofnic01alle) in the U.S. from 1645 to 1910. He talked about the challenges of standard OCR transcription of page images to text: dropped characters and misplaced entries. There were then the challenges of understanding the abbreviations in the birth and death dates for individuals and the limitations of off-the-shelf software to highlight important relations in the data.
Brian highlighted some facts derived from the data:
- Mortality in the first five years of life was very high
- Names of children within a family were often recycled if an earlier child had died very young
NYAI#7: #DataScience to Operationalize #ML (Matthew Russell) & Computational #Creativity (Dr. Cole)
Posted on November 22nd, 2016
11/22/2016 Risk, 43 West 23rd Street, NY 2nd floor
Speaker 1: Using Data Science to Operationalize Machine Learning – (Matthew Russell, CTO at Digital Reasoning)
Speaker 2: Top-down vs. Bottom-up Computational Creativity – (Dr. Cole D. Ingraham DMA, Lead Developer at Amper Music, Inc.)
Matthew Russell @DigitalReasoning spoke about understanding language using NLP, relationships among entities, and temporal relationship. For human language understanding he views technologies such as knowledge graphs and document analysis is becoming commoditized. The only way to get an advantage is to improve the efficiency of using ML: KPI for data analysis is the number of experiments (tests an hypothesis) that can be run per unit time. The key is to use tools such as:
- Vagrant – allow an environmental setup.
- Jupyter Notebook – like a lab notebook
- Git – version control
- Automation –
He wants highly repeatable experiments. The goal is to speed up the number of experiments that can be conducted per unit time.
He then talked about using machines to read medical report and determine the issues. Negatives can be extracted, but issues are harder to find. Uses an ontology to classify entities.
He talked about experiments on models using ontologies. The use of a fixed ontology depends on the content: the ontology of terms for anti-terrorism evolves over time and needs to be experimentally adjusted over time. Medical ontology is probably most static.
In the second presentation, Cole D. Ingraham @Ampermusic talked about top-down vs bottom-up creativity in the composition of music. Music differs from other audio forms since it has a great deal of very large structure as well as the smaller structure. ML does well at generating good audio on a small time frame, but Cole thinks it is better to apply theories from music to create the larger whole. This is a combination of
Top-down: novel&useful, rejects previous ideas – code driven, “hands on”, you define the structure
Bottom-up: data driven – data driven, “hands off”, you learn the structure
He then talked about music composition at the intersection of Generation vs. analysis (of already composed music) – can do one without the other or one before the other
To successfully generate new and interesting music, one needs to generate variance. Composing music using a purely probabilistic approach is problematic as there is a lack of structure. He likes the approach similar to replacing words with their synonyms which do not fundamentally change the meaning of the sentence, but still makes it different and interesting.
It’s better to work on deterministically defined variance than it is to weed out undesired results from nondeterministic code.
As an example he talked about Wavenet (google deepmind project) which input raw audio and output are raw audio. This approach works well for improving speech synthesis, but less well for music generation as there is no large scale structural awareness.
Cole then talked about Amper, as web site that lets users create music with no experience required: fast, believable, collaborative
They like a mix of top-down and bottom-up approaches:
- Want speed, but neural nets are slow
- Music has a lot of theory behind it, so it’s best to let the programmers code these rules
- Can change different levels of the hierarchical structure within music: style, mood, can also adjust specific bars
Runtime written in Haskell – functional language so its great for music
Listening to Customers as you develop, assembling a #genome, delivering food boxes
Posted on September 21st, 2016
09/21/2016 @FirstMark, 100 Fifth Ave, NY, 3rd floor
JJ Fliegelman @WayUp (formerly CampusJob) spoke about the development process used by their application which is the largest market for college students to find jobs. JJ talked about their development steps.
He emphasized the importance of specing out ideas on what they should be building and talking to your users.
They use tools to stay in touch with your customers
- HelpScout – see all support tickets. Get the vibe
- FullStory – DVR software – plays back video recordings of how users are using the software
They also put ideas in a repository using Trello.
To illustrate their process, he examined how they work to improved job search relevance.
They look at Impact per unit Effort to measure the value. They do this across new features over time. Can prioritize and get multiple estimates. It’s a probabilistic measure.
Assessing impact – are people dropping off? Do people click on it? What are the complaints? They talk to experts using cold emails. They also cultivate a culture of educated guesses
Assess effort – get it wrong often and get better over time
They prioritize impact/effort with the least technical debt
They Spec & Build – (product, architecture, kickoff) to get organized
Use Clubhouse is their project tracker: readable by humans
Architecture spec to solve today’s problem, but look ahead. Eg.. initial architecture – used wordnet, elastic search, but found that elastic search was too slow so they moved to a graph database.
Build – build as little as possible; prototype; adjust your plan
Deploy – they will deploy things that are not worse (e.g. a button that doesn’t work yet)
They do code reviews to avoid deploying bad code
Paul Fisher @Phosphorus (from Recombine – formerly focused on the fertility space: carrier-screening. Now emphasize diagnostic DNA sequencing) talked about the processes they use to analyze DNA sequences. With the rapid development of laboratory technique, it’s a computer science question now. Use Scala, Ruby, Java.
Sequencers produce hundreds of short reads of 50 to 150 base pairs. They use a reference genome to align the reads. Want multiple reads (depth of reads) to create a consensus sequence
To lower cost and speed their analysis, they focus on particular areas to maximize their read depth.
They use a variant viewer to understand variants between the person’s and the reference genome:
- SNPs – one base is changed – degree of pathogenicity varies
- Indels – insertions & deletions
- CNVs – copy variations
They use several different file formats: FASTQ, Bam/Sam, VCF
Current methods have evolved to use Spark, Parquet (columnar storage db), and Adam (use Avro framework for nested collections)
Use Zepplin to share documentation: documentation that you can run.
Finally, Andrew Hogue @BlueApron spoke about the challenges he faces as the CTO. These include
Demand forecasting – use machine learning (random forest) to predict per user what they will order. Holidays are hard to predict. People order less lamb and avoid catfish. There was also a dip in orders and orders with meat during Lent.
Fulfillment – more than just inventory management since recipes change, food safety, weather, …
Subscription mechanics – weekly engagement with users. So opportunities to deepen engagement. Frequent communications can drive engagement or churn. A/B experiments need more time to run
BlueApron runs 3 Fulfillment centers for their weekly food deliveries: NJ, Texas, CA shipping 8mm boxes per month.