Beyond Big: Merging Streaming & #Database Ops into a Next-Gen #BigData Platform
Posted on April 13th, 2017
04/13/2017 @Thoughtworks, 99 Madison Ave, New York, 15th floor
Amir Halfon, VP of Strategic Solutions, @iguazio talked about methods for speeding up a analytics linked to a large database. He started by saying that a traditional software stack accessing a db was designed to minimize the time taken to access slow disk storage. This is resulted in layers of software. Amir said that with modern data access and db architecture, processing is accelerated by a unified data engine that eliminate many of the layers. This also allows for the creation of a generic access of data stored in many different formats and a record-by-record security protocol.
To simplify development they only use AWS and only interface with Kafka, Hadoop, Spark. They are not virtualization (eventually reaches a speed limit), they do the actual store.
Another important method is to use “Predicate pushdown” =’ select … where … <predicate>’; usually all data are retrieved and then culled; instead if the predicate is pushed down, only the relevant data is retrieved. A.k.a. as an “offload-engine”.
MapR is a competitor using the HDFS database, as opposed to rebuilding the system from scratch.
#Self-learned relevancy with Apache Solr
Posted on March 31st, 2017
03/30/2017 @ Architizer , 1 Whitehall Street, New York, NT, 10th Floor
Trey Grainger @ Lucidworks covered a wide range of topics involving search.
He first reviewed the concept of an inverted index in which terms are extracted from documents and placed in an index which points back to the documents. This allows for fast searches of single terms or combinations of terms.
Next Trey covered classic relevancy scores emphasizing
tf-idf = how well a term described the document * how important is the term overall
He noted, however, the tf-idf’s values may be limited since it does not make use of domain-specific knowledge.
Trey then talked about reflected intelligence = self–learning search which uses
- Collaboration – how have others interacted with the system
- Context – information about the user
He said this method increases relevance by boosting items that are highly requested by others. Since the items boosted are those currently relevant to others, this allows the method to adapt quickly without need for manual curation of items.
Next he talked about semantic search which using its understanding of terms in the domain.
(Solr can connect to an RDF database to leverage an ontology). For instance, one can run word2vec to extract terms and phrases for a query and them determine a set of keywords/phrases to best match the query to the contents of the db.
Also, querying a semantic knowledge graph can expand the search by traversing to other relevant terms in the db
Structured and Scalable Probabilistic Topic Models
Posted on March 24th, 2017
Data Science Institute Colloquium
03/23/2017 Schapiro Hall (CEPSR), Davis Auditorium @Columbia University
John Paisley, Assistant Professor of Electrical Engineering, spoke about models to extract topics and the their structure from text. He first talked about topic models in which global variables (in this case words) were extracted from documents. In this bag-of-words approach, the topic proportions were the local variables specific to each document, while the words were common across documents.
Latent Dirichlet Analysis captures the frequency of each word. John also noted that #LDA can be use for things other than topic modeling.
- Capture assumptions with new distributions – is the new thing different?
- Embedded into more complex model structures
Next he talked about moving beyond the “flat” LDA model in which
- No structural dependency among the topics – e.g. not a tree model
- All combinations of topics are a prior equally probable
To a Hierarchical topics model in which words are placed as nodes in a tree structure with more general topics are in the root and inner branches. He uses #Bayesian inference to start the tree (assume an infinite number of branches coming out of each node) with each document a subtree within the overall tree. This approach can be further extended to a Markov chain which shows the transitions between each pair of words.
He next showed how the linkages can be computed using Bayesian inference to calculate posterior probabilities for both local and global variables: The joint likelihood of the global and local variables can be factors into a product which is conditional on the probabilities of the global variables.
He next compared the speed-accuracy trade off for three methods
- Batch inference – ingest all documents at once, so its very slow, but eventually optimal
- optimize the probability estimates for the local variables across documents (could be very large)
- optimize the probability estimates for the global variables.
- Stochastic inference – ingest small subsets of the documents
- optimize the probability estimates for the local variables across documents (could be very large)
- take a step toward to improve the probability estimates for the global variables.
- Repeat using the next subset of the documents
- MCMC, should be more accurate, but #MCMC is incredibly slow, so it can only be run on a subset
John showed that the stochastic inference method converges quickest to an accurate out-sample model.
NYAI#7: #DataScience to Operationalize #ML (Matthew Russell) & Computational #Creativity (Dr. Cole)
Posted on November 22nd, 2016
11/22/2016 Risk, 43 West 23rd Street, NY 2nd floor
Speaker 1: Using Data Science to Operationalize Machine Learning – (Matthew Russell, CTO at Digital Reasoning)
Speaker 2: Top-down vs. Bottom-up Computational Creativity – (Dr. Cole D. Ingraham DMA, Lead Developer at Amper Music, Inc.)
Matthew Russell @DigitalReasoning spoke about understanding language using NLP, relationships among entities, and temporal relationship. For human language understanding he views technologies such as knowledge graphs and document analysis is becoming commoditized. The only way to get an advantage is to improve the efficiency of using ML: KPI for data analysis is the number of experiments (tests an hypothesis) that can be run per unit time. The key is to use tools such as:
- Vagrant – allow an environmental setup.
- Jupyter Notebook – like a lab notebook
- Git – version control
- Automation –
He wants highly repeatable experiments. The goal is to speed up the number of experiments that can be conducted per unit time.
He then talked about using machines to read medical report and determine the issues. Negatives can be extracted, but issues are harder to find. Uses an ontology to classify entities.
He talked about experiments on models using ontologies. The use of a fixed ontology depends on the content: the ontology of terms for anti-terrorism evolves over time and needs to be experimentally adjusted over time. Medical ontology is probably most static.
In the second presentation, Cole D. Ingraham @Ampermusic talked about top-down vs bottom-up creativity in the composition of music. Music differs from other audio forms since it has a great deal of very large structure as well as the smaller structure. ML does well at generating good audio on a small time frame, but Cole thinks it is better to apply theories from music to create the larger whole. This is a combination of
Top-down: novel&useful, rejects previous ideas – code driven, “hands on”, you define the structure
Bottom-up: data driven – data driven, “hands off”, you learn the structure
He then talked about music composition at the intersection of Generation vs. analysis (of already composed music) – can do one without the other or one before the other
To successfully generate new and interesting music, one needs to generate variance. Composing music using a purely probabilistic approach is problematic as there is a lack of structure. He likes the approach similar to replacing words with their synonyms which do not fundamentally change the meaning of the sentence, but still makes it different and interesting.
It’s better to work on deterministically defined variance than it is to weed out undesired results from nondeterministic code.
As an example he talked about Wavenet (google deepmind project) which input raw audio and output are raw audio. This approach works well for improving speech synthesis, but less well for music generation as there is no large scale structural awareness.
Cole then talked about Amper, as web site that lets users create music with no experience required: fast, believable, collaborative
They like a mix of top-down and bottom-up approaches:
- Want speed, but neural nets are slow
- Music has a lot of theory behind it, so it’s best to let the programmers code these rules
- Can change different levels of the hierarchical structure within music: style, mood, can also adjust specific bars
Runtime written in Haskell – functional language so its great for music
#Genomic analysis and #BigData using #FPGA’s
Posted on November 17th, 2016
11/17/2016 @ Phosphous, 1140 Broadway, NY, 11th floor
Rami Mehio @Edico Genome spoke about the fast analysis of a human genome (initially did secondary analysis which is similar to telecommunications – errors in the channel) as errors come from the process due to the repeats and mistakes in the sequencer)
Genomic data doubles every 7 months historically, but the computational speed to do the analysis lags, as Moore’s law has a doubling every 18 months. With standard CPUs, mapping takes 10 to 30 hours on a 24 core server. Quality control adds several hours.
In addition, a human genome file is a 80GB Fastq file. (this is only for a rough look at the genome at 30x = # times DNA is multiplied = #times the analysis is redone.)
Using FPGAs reduced the analysis time to 20 minutes. Also the files in CRAM compression are reduced to 50GB.
The server code is in C/C++. The FPGAs are not programmed, but their connectors are specified using the VITAL or VHDL languages.
HMM and Smith-Waterman algorithms require the bulk of the processing time, so both are implemented in the FPGAs. Other challenges are to get sufficient data to feed the FPGA which means the software needs to run in parallel. Also, the FPGAs are configured so they can change the algorithm selectively to make advantage of what needs to be done at the time.
Listening to Customers as you develop, assembling a #genome, delivering food boxes
Posted on September 21st, 2016
09/21/2016 @FirstMark, 100 Fifth Ave, NY, 3rd floor
JJ Fliegelman @WayUp (formerly CampusJob) spoke about the development process used by their application which is the largest market for college students to find jobs. JJ talked about their development steps.
He emphasized the importance of specing out ideas on what they should be building and talking to your users.
They use tools to stay in touch with your customers
- HelpScout – see all support tickets. Get the vibe
- FullStory – DVR software – plays back video recordings of how users are using the software
They also put ideas in a repository using Trello.
To illustrate their process, he examined how they work to improved job search relevance.
They look at Impact per unit Effort to measure the value. They do this across new features over time. Can prioritize and get multiple estimates. It’s a probabilistic measure.
Assessing impact – are people dropping off? Do people click on it? What are the complaints? They talk to experts using cold emails. They also cultivate a culture of educated guesses
Assess effort – get it wrong often and get better over time
They prioritize impact/effort with the least technical debt
They Spec & Build – (product, architecture, kickoff) to get organized
Use Clubhouse is their project tracker: readable by humans
Architecture spec to solve today’s problem, but look ahead. Eg.. initial architecture – used wordnet, elastic search, but found that elastic search was too slow so they moved to a graph database.
Build – build as little as possible; prototype; adjust your plan
Deploy – they will deploy things that are not worse (e.g. a button that doesn’t work yet)
They do code reviews to avoid deploying bad code
Paul Fisher @Phosphorus (from Recombine – formerly focused on the fertility space: carrier-screening. Now emphasize diagnostic DNA sequencing) talked about the processes they use to analyze DNA sequences. With the rapid development of laboratory technique, it’s a computer science question now. Use Scala, Ruby, Java.
Sequencers produce hundreds of short reads of 50 to 150 base pairs. They use a reference genome to align the reads. Want multiple reads (depth of reads) to create a consensus sequence
To lower cost and speed their analysis, they focus on particular areas to maximize their read depth.
They use a variant viewer to understand variants between the person’s and the reference genome:
- SNPs – one base is changed – degree of pathogenicity varies
- Indels – insertions & deletions
- CNVs – copy variations
They use several different file formats: FASTQ, Bam/Sam, VCF
Current methods have evolved to use Spark, Parquet (columnar storage db), and Adam (use Avro framework for nested collections)
Use Zepplin to share documentation: documentation that you can run.
Finally, Andrew Hogue @BlueApron spoke about the challenges he faces as the CTO. These include
Demand forecasting – use machine learning (random forest) to predict per user what they will order. Holidays are hard to predict. People order less lamb and avoid catfish. There was also a dip in orders and orders with meat during Lent.
Fulfillment – more than just inventory management since recipes change, food safety, weather, …
Subscription mechanics – weekly engagement with users. So opportunities to deepen engagement. Frequent communications can drive engagement or churn. A/B experiments need more time to run
BlueApron runs 3 Fulfillment centers for their weekly food deliveries: NJ, Texas, CA shipping 8mm boxes per month.
NYAI#5: Neural Nets (Jason Yosinski) & #ML For Production (Ken Sanford)
Posted on August 24th, 2016
08/24/2016 @Rise 43 West 23rd Street, NY, 2nd floorPreview Changes
Jason Yosinski@GeometricTechnology spoke about his work on #NeuralNets to generate pictures. He started by talking about machine learning with feedback to train a robot to move more quickly and using feedback to computer-generate pictures that are appealing to humans.
Jason next talked about AlexNet, based on work by Krizhevsky et al 2012, to classify images using a neural net with 5 convolutional layers (interleaved with max pooling and contrast layers) plus 3 fully connected layers at the end. The net with 60 million parameters was training on ImageNet which contains over 1mm images. His image classification Code is available on http://Yosinski.com.
Jason talked about how the classifier thinks about categories when it is not being trained to identify that category. For instance, the network may learn about faces even though there is no human category since it helps the system detect things such as hats (above a face) to give it context. It also identifies text to give it context on other shapes it is trying to identify.
He next talked about generating images by inputting random noise and randomly changing pixels. Some changes will cause the goal (such as a ‘lions’) to increase in confidence. Over many random moves, the goal increases in its confidence level. Jason showed many random images that elicited high levels of confidence, but the images often looked like purple-green slime. This is probably because the network, while learning, immediately discards the overall color of the image and is therefore insensitive to aberrations from normal colors. (See Erhan et al 2009)
[This also raises the question of how computer vision is different from human vision. If presented with a blue colored lion, the first reaction of a human might be to note how the color mismatches objects in the ‘lion’ category. One experiment would be to present the computer model with the picture of a blue lion and see how it is classified. Unlike computers, humans encode information beyond their list of items they have learned and this encoding includes extraneous information such as color or location. Maybe the difference is that humans incorporate a semantic layer that considers not only the category of the items, but other characteristics that define ‘lion-ness’. Color may be more central to human image processing as it has been conjectured that we have color vision so we can distinguish between ripe and rotten fruits. Our vision also taps into our expectation to see certain objects within the world and we are primed to see those objects in specific contexts, so we have contextual information beyond what is available to the computer when classifying images.]
To improve the generated pictures of ‘lions’, he next used a generator to create pictures and change them until they get a picture which has high confidence of being a ‘lion’. The generator is designed to create identifiable images. The generator can even produce pictures on objects that it has not been trained to paint. (Need to apply regularization to get better pictures for the target.)
Slides at http://s.yosinski.com/nyai.pdf
In the second talk, Ken Sanford @Ekenomics and H20.AI talked about the H2O open source project. H2O is a machine learning engine that can run in R, Python,Java, etc.
Ken emphasized how H2O (a multilayer feed forward neural network) provides a platform that uses the Java Score Code engine. This easies the transition from the model developed in training and the model used to score inputs in a production environment.
He also talked about the Deep Water project which aims to allow other open source tools, such as MXNET, Caffe, Tensorflow,… (CNN, RNN, … models) to run in the H2O environment.
#Visualization Metaphors: unraveling the big picture
Posted on May 19th, 2016
05/18/2016 @TheGraduateCenter CUNY, 365 5th Ave, NY
Manuel Lima ( @mslima ) @Parsons gave examples of #data representations. He first looked back 800 years and talked about Ars Memorativa, the art of memory , a set of mnemonic principals to organize information: e.g. spatial orientation, order of things on paper, chunking, association (to reinforce relations), affect, repetition. (These are also foundation principals of #Gestalt psychology).
Of the many metaphors, trees are most used: e.g. tree of life and the tree of good and evil. geneology, evolution, laws, …
Manuel then talked about how #trees work well for hierarchical systems, but we are looking more frequently at more complex systems. In science, for instance:
17-19th century – single variable relationships
20th century – systems of relationships (trees)
21st century – organized complexity (networks)
Even the tree of life can be seen as a network once bacteria’s interaction with organisms is overlaid on the tree.
He then showed various 15 distinct typologies for mapping networks and showed works of art inspired by networks (the new networkism) : 2-d: Emma McNally, 3-d: Tomas Saraceno and Chiharu Shiota.
The following authors were suggested as references on network visualization: Edward Tufte, Jacques Bertin (French philosopher), and Pat Hanrahan (a computer science prof at Stanford extended his work, also one of the founders of Tableau)
#DataDrivenNYC: #FaultTolerant #Web sites, #Finance, Predicting #B2B buying behavior, training #DeepLearning
Posted on May 18th, 2016
05/18/2016 @AXA auditorium, 787 7th Avenue, NY
Four speakers presented:
- Peter Brodsky, Founder and CEO of HyperScience (AI for the enterprise)
- Louis DiModugno, Chief Data and Analytics Officer at AXA US(global leader in insurance)
- Amanda Kahlow, Founder and CEO of 6Sense (B2B predictive intelligence)
- Nicolas Dessaigne, Founder and CEO of Algolia (hosted search API that delivers instant results)
First, Nicolas Dessaigne @Algolia (Subscription service to access a search API) talked about the challenges building a highly fault-tolerant world-wide service. The steps resulted from their understanding of points of failure within their systems and the infrastructure their systems depend on.
Initially, they concentrated on their software development process including failed updates. To overcome these problems, they update one server at a time (with a rack of servers), do partial updates, use Chef to automate deployment.
Then they migrated their DNS provider from .io to .net TLD to avoid slow response times they had seen intermittently in Asia. This was followed by the upgrades:
Feb 2015. Set up clusters of servers world-wide , so users have a server in their region: lower latency
March 2015. Physically separate server clusters within a region to different providers
May 2015. Create fallback DNS servers
July 2015. Put a third data center online to make indexing robust
April 2016. Implement a 1 second granularity for their system monitoring
Next, Matt Turck interviewed Louis DiModugno @AXA . In the US, AXA’s main focus is on predictive underwriting of insurance process. They also have projects to incorporate sensors into products and correctly route queries to call centers based on the demographics of the customer. World-wide they have three analysis hubs: France, US, Singapore (coming online).
Louis oversees both data and analytics in the U.S. and both he and the CTO report to the CIO. They are interested in expanding their capabilities in areas such as creating unstructured databases from life insurance data that are currently on microfiche.
In the third presentation, Amanda Kahlow @6Sense talked about their business model to provide information to customers in B2B commerce. They analyze business searches, customer web sites, visits to publisher’s (e.g. Forbes) web sites. Their goal is to determine the timing of customer purchases.
B2B purchases are different from B2C purchases since
- Businesses research their purchases online before they buy
- The research takes time (long sales cycle)
- The decision to buy involves multiple people within the company
So, there are few impulse buys and buyer behavior signals that a purchase is imminent.
The main CMO question is when (not who).
6sense ties data across searches (anonymous data). The goal is to identify when companies are in a specific part of the buying cycle, so sales can approach them now. (Example: show click-to-chat when the analytics says that the customer is ready to buy)
Lastly, Peter Brodsky @HyperScience spoke about tools they are developing to speed machine learning. These include
- Tools to make it easier to add new data sets
- need to match fields, such as date which may be in different formats
- what to do with missing data
- need labeled data – lots of examples
- Speed up training time
The speed up is done by identifying subnets within the larger neural network. The subnets perform distinct functions. To determine if two subnets (in different networks) are equivalent, move one subnet from one network to replace another subnet in another network and see if the function is unchanged: Freeze the weights within the subnet and outside the subnet. Retrain the interface between the net and the subnet.
This creates building blocks which can be combined into larger blocks. These blocks can be applied to jump start the training process.
#DeepLearning and #NeuralNets
Posted on May 16th, 2016
05/16/2016 @Qplum, 185 Hudson Street, Suite 1620 Plaza 5, Jersey City, NJ
#Raghavendra Boomaraju @ Columbia described the math behind neural nets and how back propagation is used to fit models.
Observations on deep learning include:
- Universal approximation theory says you can fit any model with one hidden layer, provided the layer has a sufficient number of levels. But multiple hidden layers work better. The more layers, the fewer levels you need in each layer to fit the data.
- To optimize the weights, back-propagate the loss function. But one does not need to optimize the g() function since g()’s are designed to have a very general shape (such as the logistic)
- Traditionally, fitting has been done by changing all inputs simultaneously (deterministic) or changing one input at a time during optimization (stochastic inputs) . More recently, researchers are changing subsets of the inputs (minibatches).
- Convolution operators are used to standardize inputs by size and orientation by rotating and scaling.
- There is a way to Visualize within a neural network – see http://colah.github.io/
- The gradient method used to optimize weights needs to be tuned so it is neither too aggressive nor too slow. Adaptive learning (Adam algorithm) is used to determine the optimal step size.
- Deep learning libraries include Theano, café, Google tensor flow, torch.
- To do unsupervised deep learning – take inputs through a series of layers that at some point have fewer levels than the number of inputs. The ensuing layers expand so the number of points on the output layer matches that of the input layer. Optimize the net so that the inputs match the outputs. The layer with the smallest number of point s describes the features in the data set.