Structured and Scalable Probabilistic Topic Models
Posted on March 24th, 2017
Data Science Institute Colloquium
03/23/2017 Schapiro Hall (CEPSR), Davis Auditorium @Columbia University
John Paisley, Assistant Professor of Electrical Engineering, spoke about models to extract topics and the their structure from text. He first talked about topic models in which global variables (in this case words) were extracted from documents. In this bag-of-words approach, the topic proportions were the local variables specific to each document, while the words were common across documents.
Latent Dirichlet Analysis captures the frequency of each word. John also noted that #LDA can be use for things other than topic modeling.
- Capture assumptions with new distributions – is the new thing different?
- Embedded into more complex model structures
Next he talked about moving beyond the “flat” LDA model in which
- No structural dependency among the topics – e.g. not a tree model
- All combinations of topics are a prior equally probable
To a Hierarchical topics model in which words are placed as nodes in a tree structure with more general topics are in the root and inner branches. He uses #Bayesian inference to start the tree (assume an infinite number of branches coming out of each node) with each document a subtree within the overall tree. This approach can be further extended to a Markov chain which shows the transitions between each pair of words.
He next showed how the linkages can be computed using Bayesian inference to calculate posterior probabilities for both local and global variables: The joint likelihood of the global and local variables can be factors into a product which is conditional on the probabilities of the global variables.
He next compared the speed-accuracy trade off for three methods
- Batch inference – ingest all documents at once, so its very slow, but eventually optimal
- optimize the probability estimates for the local variables across documents (could be very large)
- optimize the probability estimates for the global variables.
- Stochastic inference – ingest small subsets of the documents
- optimize the probability estimates for the local variables across documents (could be very large)
- take a step toward to improve the probability estimates for the global variables.
- Repeat using the next subset of the documents
- MCMC, should be more accurate, but #MCMC is incredibly slow, so it can only be run on a subset
John showed that the stochastic inference method converges quickest to an accurate out-sample model.
Critical Approaches to #DataScience & #MachineLearning
Posted on March 18th, 2017
3/17/2017 @Hunter College, 68th & Lexington Ave, New York, Lang Theater
Geetu Ambwani @HuffingtonPost @geetuji spoke about how the Huffington Post is looking at data as a way around the filter bubble in which separates individuals from views that are contrary to their previously help beliefs. Filter bubbles are believed to be a major reason for the current levels of polarization in society.
The talked about ways that the media can respond to this confirmation bias
- Show opposing point of view
- Show people their bias
- Show source crediability
For instance, Chrome and Buzzfeed have tools that will insert opposing points of view in your news feed. Flipfeed enables you to easily load another feed. AlephPost clusters articles and color codes them indicating the source’s vantage view. However, showing people opposing views can backfire.
Second, Readacross the spectrum will show you your biases. Politico will show you how blue or red you by indicating the color of your information sources.
Third, one can show source credibility and where it lies on the political spectrum
However, there is still a large gap between what is produced by the media and what consumers want. Also this does not remove the problem that ad dollars are given for “engagement” which means that portals are incented to continue delivering what the reader wants.
Next, Justin Hendrix @NYC Media Lab (consortium of universities started by the city of NY) talked about emerging media technologies. Examples were
- Vidrovr – teach computers how to watch video – produce searchable tags.
- Data selfi project – from the new school. See the data which Facebook has on us. A chrome extension. 100k downloads in the first week.
- Braiq – connect the mind with the on-board self-driving software on cars. Build software which is more reactive to the needs and wants of the passenger. Technology in the headrest and other inputs that will talk to the self-driving AI.
The follow up discussion covered a wide range of topics including
- The adtech fraud is known, but no one has the incentive to address. Fake audience – bots clicking sites
- Data sources are readily available lead by the Twitter or Facebook APIs. Get on github for open source code on downloading data
- Was the 20th century an aberration as to how information was disseminated? We might just be going back to a world with pools of information.
- What are the limits on what points of view any media company is willing to explore?
- What is the future of work and the social contract as jobs disappear?
Intro to #DeepLearning using #PyTorch
Posted on February 21st, 2017
02/21/2017 @ NYU Courant Institute (251 Mercer St, New York, NY)
Soumith Chintala @Facebook first talked about trends in the cutting edge of machine learning. His main point was that the world is moving from fixed agents to dynamic neural nets in which agents restructure themselves over time. Currently, the ML world is dominated by static datasets + static model structures which learn offline and do not change their structure without human intervention.
He then talked about PyTorch which is the next generation of ML tools after Lua #Torch. In creating PyTorch they wanted to keep the best features of LuaTorch, such as performance and extensibility while eliminating rigid containers and allowing for execution on multiple-GPU systems. PyTorch is also designed so programmers can create dynamic neural nets.
Other features include
- Kernel fusion – take several objects and fuse them into a single object
- Order of execution – reorder objects for faster execution
- Automatic work placement when you have multiple GPUs
PyTorch is available for download on http://pytorch.org and was released Jan 18, 2017.
Currently, PyTorch runs only on Linux and OSX.
#ComputerScience and #DigitalHumanities
Posted on December 8th, 2016
PRINCETON #ACM / #IEEE-CS CHAPTERS DECEMBER 2016 JOINT MEETING
12/08/2016 @Princeton University Computer Science Building, Small Auditorium, Room CS 105, Olden and William Streets, Princeton NJ
Brian Kernighan @Princeton University spoke about how computers can assist in understanding research topics in the humanities.
He started by presenting examples of web sites with interactive tools for exploring historical material
- Explore a northern and a southern town during the Civil War: http://valley.lib.virginia.edu/
- Expedia for a traveler across ancient Roman: http://orbis.stanford.edu/
- The court records in London from 1674-1913: https://www.oldbaileyonline.org/
- Hemingway and other literary stars in Paris from the records of Sylvia Beach
Brian then talked about the challenges of converting the archival data: digitize, meta tag, store, query, present results, make available to the public
In preparation for teaching a class this fall on digital humanities, he talked about his experience extracting information from a genealogy based on the descendents of Nicholas Cady (https://archive.org/details/descendantsofnic01alle) in the U.S. from 1645 to 1910. He talked about the challenges of standard OCR transcription of page images to text: dropped characters and misplaced entries. There were then the challenges of understanding the abbreviations in the birth and death dates for individuals and the limitations of off-the-shelf software to highlight important relations in the data.
Brian highlighted some facts derived from the data:
- Mortality in the first five years of life was very high
- Names of children within a family were often recycled if an earlier child had died very young
NYAI#7: #DataScience to Operationalize #ML (Matthew Russell) & Computational #Creativity (Dr. Cole)
Posted on November 22nd, 2016
11/22/2016 Risk, 43 West 23rd Street, NY 2nd floor
Speaker 1: Using Data Science to Operationalize Machine Learning – (Matthew Russell, CTO at Digital Reasoning)
Speaker 2: Top-down vs. Bottom-up Computational Creativity – (Dr. Cole D. Ingraham DMA, Lead Developer at Amper Music, Inc.)
Matthew Russell @DigitalReasoning spoke about understanding language using NLP, relationships among entities, and temporal relationship. For human language understanding he views technologies such as knowledge graphs and document analysis is becoming commoditized. The only way to get an advantage is to improve the efficiency of using ML: KPI for data analysis is the number of experiments (tests an hypothesis) that can be run per unit time. The key is to use tools such as:
- Vagrant – allow an environmental setup.
- Jupyter Notebook – like a lab notebook
- Git – version control
- Automation –
He wants highly repeatable experiments. The goal is to speed up the number of experiments that can be conducted per unit time.
He then talked about using machines to read medical report and determine the issues. Negatives can be extracted, but issues are harder to find. Uses an ontology to classify entities.
He talked about experiments on models using ontologies. The use of a fixed ontology depends on the content: the ontology of terms for anti-terrorism evolves over time and needs to be experimentally adjusted over time. Medical ontology is probably most static.
In the second presentation, Cole D. Ingraham @Ampermusic talked about top-down vs bottom-up creativity in the composition of music. Music differs from other audio forms since it has a great deal of very large structure as well as the smaller structure. ML does well at generating good audio on a small time frame, but Cole thinks it is better to apply theories from music to create the larger whole. This is a combination of
Top-down: novel&useful, rejects previous ideas – code driven, “hands on”, you define the structure
Bottom-up: data driven – data driven, “hands off”, you learn the structure
He then talked about music composition at the intersection of Generation vs. analysis (of already composed music) – can do one without the other or one before the other
To successfully generate new and interesting music, one needs to generate variance. Composing music using a purely probabilistic approach is problematic as there is a lack of structure. He likes the approach similar to replacing words with their synonyms which do not fundamentally change the meaning of the sentence, but still makes it different and interesting.
It’s better to work on deterministically defined variance than it is to weed out undesired results from nondeterministic code.
As an example he talked about Wavenet (google deepmind project) which input raw audio and output are raw audio. This approach works well for improving speech synthesis, but less well for music generation as there is no large scale structural awareness.
Cole then talked about Amper, as web site that lets users create music with no experience required: fast, believable, collaborative
They like a mix of top-down and bottom-up approaches:
- Want speed, but neural nets are slow
- Music has a lot of theory behind it, so it’s best to let the programmers code these rules
- Can change different levels of the hierarchical structure within music: style, mood, can also adjust specific bars
Runtime written in Haskell – functional language so its great for music
Listening to Customers as you develop, assembling a #genome, delivering food boxes
Posted on September 21st, 2016
09/21/2016 @FirstMark, 100 Fifth Ave, NY, 3rd floor
JJ Fliegelman @WayUp (formerly CampusJob) spoke about the development process used by their application which is the largest market for college students to find jobs. JJ talked about their development steps.
He emphasized the importance of specing out ideas on what they should be building and talking to your users.
They use tools to stay in touch with your customers
- HelpScout – see all support tickets. Get the vibe
- FullStory – DVR software – plays back video recordings of how users are using the software
They also put ideas in a repository using Trello.
To illustrate their process, he examined how they work to improved job search relevance.
They look at Impact per unit Effort to measure the value. They do this across new features over time. Can prioritize and get multiple estimates. It’s a probabilistic measure.
Assessing impact – are people dropping off? Do people click on it? What are the complaints? They talk to experts using cold emails. They also cultivate a culture of educated guesses
Assess effort – get it wrong often and get better over time
They prioritize impact/effort with the least technical debt
They Spec & Build – (product, architecture, kickoff) to get organized
Use Clubhouse is their project tracker: readable by humans
Architecture spec to solve today’s problem, but look ahead. Eg.. initial architecture – used wordnet, elastic search, but found that elastic search was too slow so they moved to a graph database.
Build – build as little as possible; prototype; adjust your plan
Deploy – they will deploy things that are not worse (e.g. a button that doesn’t work yet)
They do code reviews to avoid deploying bad code
Paul Fisher @Phosphorus (from Recombine – formerly focused on the fertility space: carrier-screening. Now emphasize diagnostic DNA sequencing) talked about the processes they use to analyze DNA sequences. With the rapid development of laboratory technique, it’s a computer science question now. Use Scala, Ruby, Java.
Sequencers produce hundreds of short reads of 50 to 150 base pairs. They use a reference genome to align the reads. Want multiple reads (depth of reads) to create a consensus sequence
To lower cost and speed their analysis, they focus on particular areas to maximize their read depth.
They use a variant viewer to understand variants between the person’s and the reference genome:
- SNPs – one base is changed – degree of pathogenicity varies
- Indels – insertions & deletions
- CNVs – copy variations
They use several different file formats: FASTQ, Bam/Sam, VCF
Current methods have evolved to use Spark, Parquet (columnar storage db), and Adam (use Avro framework for nested collections)
Use Zepplin to share documentation: documentation that you can run.
Finally, Andrew Hogue @BlueApron spoke about the challenges he faces as the CTO. These include
Demand forecasting – use machine learning (random forest) to predict per user what they will order. Holidays are hard to predict. People order less lamb and avoid catfish. There was also a dip in orders and orders with meat during Lent.
Fulfillment – more than just inventory management since recipes change, food safety, weather, …
Subscription mechanics – weekly engagement with users. So opportunities to deepen engagement. Frequent communications can drive engagement or churn. A/B experiments need more time to run
BlueApron runs 3 Fulfillment centers for their weekly food deliveries: NJ, Texas, CA shipping 8mm boxes per month.
#Unsupervised Learning (Soumith Chintala) & #Music Through #ML (Brian McFee)
Posted on July 26th, 2016
07/25/2016 @Rise, 28 West 24rd Street, NY, 2nd floor
Two speakers spoke about machine learning
In the first presentation, Brian McFee @NYU spoke about using ML to understanding the patterns of beats in music. He graphs beats identified by Mel-frequency cepstral coefficients (#MFCCs)
Random walk theory combines two representations of points in the graph.
- Local: In the graph, each point is a beat, edge connect adjacent beats. Weight edges by MFCC .
- Repetition: Link k-nearest neighbor by repetition = same sound – look for beats. Weight by similarity (k is set to the square root of the number of beats)
- Combination: A = mu * local + (1-mu)*repetition; optimize mu for a balanced random walk , so probability of a local move – probability of a repetition move over all vertices. Use a least squares optimization to find mu so the two parts of the equation make equal contributions across all points to the value of A.
The points are then partitioned by spectral clustering: normalize Laplacian – take bottom eigenvectors which encode component membership for each beat; cluster the eigenvectors Y of L to reveal the structure. Gives hierchical decomposition of the time series. m=1, the entire song. m=2 gets the two components of the song. As you add more eigenvectors, the number of segments within the song increases.
Brain then showed how this segmentation can create compelling visualizations of the structure of a song.
The Python code used for this analysis is available in the msaf library.
He has worked on convolutional neural nets, but find them to be better at handing individual notes within the song (by contrast, rhythm is over a longer time period)
In the second presentation, Soumith Chintala talked about #GenerativeAdversarialNetworks (GAN).
Generative networks consist of a #NeuralNet “generator” that produces an image. It takes as input a high dimensional matrix (100 dimensions) of random noise. In a Generative Adversarial Networks a generator creates an image which is optimized over a loss function which evaluates “does it look real”. The decision of whether the image looks real is determined by a second neural net “discriminator” that tries to pick the fake image from a set of other real images plus the output of the generator.
Both the generator and discriminator NN’s are trained by gradient descent to optimize their individual performance: Generator = max game; discriminator = min game. The process optimizes Jensen-Shannon divergence.
Soumith then talked about extensions to GAN. These include
Class-conditional GANS – take noise + class of samples as input to the generator.
Video prediction GANS –predict what happens next given the previous 2 or 3 frames. Added a MSE loss (in addition to the discriminator classification loss) which compares what happened to what is predicted
Deep Convolution GAN – try to make the learning more stable by using a CNN.
Text-conditional GAN – input =noise + text. Use LSTM model on the text input. Generate images
Disentangling representations – InfoGAN – input random noise + categorical variables.
GAN is still unstable especially for larger images, so work to improve it includes
- Feature matching – take groups of features instead of just the whole image.
- Minibatch learning
No one has successfully used GAN for text-in to text-out
The meeting was concluded by a teaser for Watchroom – crowd funded movie on AI and VR.
Automatically scalable #Python & #Neuroscience as it relates to #MachineLearning
Posted on June 28th, 2016
06/28/2016 @Rise, 43 West 23rd Street, NY, 2nd floor
Braxton McKee (@braxtonmckee ) @Ufora first spoke about the challenges of creating a version of Python (#Pyfora) that naturally scales to take advantage of the hardware to handle parallelism as the problem grows.
Braxton presented an example in which we compute the minimum distance from target points a larger universe of points base on their Cartesian coordinates. This is easily written for small problems, but the computation needs to be optimized when computing this value across many cpu’s.
However, the allocation across cpu’s depends on the number of targets relative to the size of the point universe. Instead of trying to solve this analytically, they use a #Dynamicrebalancing strategy that splits the task and adds resources to the subtasks creating bottlenecks.
This approach solves many resource allocation problems, but still faces challenges
- nested parallelism. They look for parallelism within the code and look for bottlenecks at the top level of parallelism and split the task into subtasks at that level, …
- the data do not fit in memory. They break tasks into smaller tasks. They also have each task know which other caches hold data, so they can be accessed directly without going to slower main memory
- different types of architectures (such as gpu’s) require different types of optimization
- the optimizer cannot look inside python packages, so cannot optimize a bottleneck within a package.
- is a just-in-time compiler that moves stack frames from machine-to-machine and senses how to take advantage of parallelism
- tracks what data a thread is using
- dynamically schedules threads and data
- takes advantage of mutability which allows the compiler to assume that functions do no change over time so the compiler can look inside the function when optimizing execution
- is written on top of another language which allows for the possibility of porting the method to other languages
In the second presentation, Jeremy Freeman @Janelia.org spoke about the relationship between neuroscience research and machine learning models. He first talking about the early works on understanding the function of the visual cortex.
Findings by Hubel & Wiesel in1959 have set the foundation for visual processing models for the past 40 years. They found that Individual neurons in the V1 area of the visual cortex responded to the orientation of lines in the visual field. These inputs fed neurons that detect more complex features, such as edges, moving lines, etc.
Others also considered systems which have higher level recognition and how to train a system. These include
Perceptrons by Rosenblatt, 1957
Neocognitrons by Fukushima, 1980
Hierarchical learning machines, Lecun, 1985
Back propagation by Rumelhart, 1986
His doctoral research looked at the activity of neurons in V2 area. They found they could generate high order patterns that some neurons discriminate among.
But in 2012, there was a jump in performance of neural nets – U. of Toronto
By 2014, some of the neural network algos perform better than humans and primates, especially in the area of image processing. This has lead to many advances such as Google deepdream which combines images and texture to create an artistic hybrid image.
Recent scientific research allows one to look at thousands of neurons simultaneously. He also talked about some of his current research which uses “tactile virtual reality” to examine the neural activity as a mouse explores a maze (the mouse walks on a ball that senses it’s steps as it learns the maze).
Jeremy also spoke about Model-free episodic control for complex sequential tasks requiring memory and learning. ML research has created models such as LSTM and Neural Turing Nets which retain state representations. Graham Taylor has looked at neural feedback modulation using gates.
He also notes that there are similar functionalities between the V1 area in the visual cortex, the A1 auditory area, and the S1, tactile area.
To find out more, he suggested visiting his github site: Freeman-lab and looking the web site neurofinder.codeneuro.org.
DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.
Posted on June 18th, 2016
06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)
The four speakers were
- Nitay Joffe, Founder and CTO of ActionIQ (next-generation data platform for marketing and consumer data)
- Adam Kanouse, CTO of Narrative Science (transforms data into meaningful and insightful narratives)
- Neha Narkhede, Founder and CTO of Confluent (real-time data platform built around Apache Kafka)
- Christopher Nguyen, Founder and CEO of Arimo (data intelligence platform)
Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.
They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.
Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.
Advanced #DeepLearning #NeuralNets: #TimeSeries
Posted on June 16th, 2016
06/15/2016 @Qplum, 185 Hudson Street, Jersey City, NJ, suite 1620
Sumit then broke the learning process into two steps: feature extraction and classification. Starting with raw data, the feature extractor is the deep learning model that prepares the data for the classifier which may be a simple linear model or random forest. In supervised training, errors in the prediction output by the classifier are feed back into the system using back propagation to tune the parameters of the feature extractor and the classifier.
In the remainder of the talk Sumit concentrated on how to improve the performance of the feature extractor.
In the general text classification (unlike image or speech recognition) the length of the input can be very long (and variable in length). In addition, analysis of text by general deep learning models
- does not capture order of words or predictions in time series
- can handle only small sized windows or the number of parameters explodes
- cannot capture long term dependencies
So, the feature extractor is cast as a time delay neural networks (#TDNN). In TDNN, the words are text is viewed as a string of words. Kernel matrices (usually of from 3 to 5 unit long) are defined which compute a dot products of the weights of the words in a contiguous block of text. The kernel matrix is shifted one word and the process is repeated until all words are processed. A second kernel matrix creates another set of features and so forth for a 3rd kernel, etc.
These features are then pooled using the mean or max of the features. This process is repeated to get additional features. Finally a point-wise non-linear transformation is applied to get the final set of features.
Unlike traditional neural network structures, these methods are new, so no one has done a study of what is revealed in the first layer, second layer, etc. Also theoretical work is lacking on the optimal number of layers for a text sample of a given size.
Historically, #TDNN has struggled with a series of problem including convergence issues, so recurrent neural networks (#RNN) were developed in which the encoder looks at the latest data point along with its own previous output. One example is the Elman Network, which each feature is the weighted sum of the kernel function (one encoder is used for all points on the time series) output with the previously computed feature value. Training is conducted as in a standard #NN using back propagation through time with the gradient accumulated over time before the encoder is re-parameterized, but RNN has a lot issues
1, exploding or vanishing gradients – depending on the largest eigenvalue
2. cannot capture long-term dependencies
3. training is somewhat brittle
The fix is called Long short-term memory. #LSTM, has additional memory “cells” to store short-term activations. It also has additional gates to alleviate the vanishing gradient problem.
(see Hochreiter et al . 1997). Now each encoder is made up of several parts as shown in his slides. It can also have a forget gate that turns off all the inputs and can peep back at the previous values of the memory cell. At Facebook, NLP and speech and vision recognition are all users of LSTM models
LSTM models, however still don’t have a long term memory. Sumit talked about how creating memory networks which will take a store and store the key features in a memory cell. A query runs against the memory cell and then concatenates the output vector with the text. A second query will retrieve the memory.
He also talked about using a dropout method to fight overfitting. Here, there are cells that randomly determine whether a signal is transmitted to the next layer
Autocoders can be used to pretrain the weights within the NN to avoid problems of creating solution that are only locally optimal instead of globally optimal.
[Many of these methods are similar in spirit to existing methods. For instance, kernel functions in RNN are very similar to moving average models in technical trading. The different features correspond to averages over different time periods and higher level features correspond to crossovers of the moving averages.
The dropoff method is similar to the techniques used in random forest to avoid overfitting.]