#Genomic analysis and #BigData using #FPGA’s
Posted on November 17th, 2016
11/17/2016 @ Phosphous, 1140 Broadway, NY, 11th floor
Rami Mehio @Edico Genome spoke about the fast analysis of a human genome (initially did secondary analysis which is similar to telecommunications – errors in the channel) as errors come from the process due to the repeats and mistakes in the sequencer)
Genomic data doubles every 7 months historically, but the computational speed to do the analysis lags, as Moore’s law has a doubling every 18 months. With standard CPUs, mapping takes 10 to 30 hours on a 24 core server. Quality control adds several hours.
In addition, a human genome file is a 80GB Fastq file. (this is only for a rough look at the genome at 30x = # times DNA is multiplied = #times the analysis is redone.)
Using FPGAs reduced the analysis time to 20 minutes. Also the files in CRAM compression are reduced to 50GB.
The server code is in C/C++. The FPGAs are not programmed, but their connectors are specified using the VITAL or VHDL languages.
HMM and Smith-Waterman algorithms require the bulk of the processing time, so both are implemented in the FPGAs. Other challenges are to get sufficient data to feed the FPGA which means the software needs to run in parallel. Also, the FPGAs are configured so they can change the algorithm selectively to make advantage of what needs to be done at the time.
Listening to Customers as you develop, assembling a #genome, delivering food boxes
Posted on September 21st, 2016
09/21/2016 @FirstMark, 100 Fifth Ave, NY, 3rd floor
JJ Fliegelman @WayUp (formerly CampusJob) spoke about the development process used by their application which is the largest market for college students to find jobs. JJ talked about their development steps.
He emphasized the importance of specing out ideas on what they should be building and talking to your users.
They use tools to stay in touch with your customers
- HelpScout – see all support tickets. Get the vibe
- FullStory – DVR software – plays back video recordings of how users are using the software
They also put ideas in a repository using Trello.
To illustrate their process, he examined how they work to improved job search relevance.
They look at Impact per unit Effort to measure the value. They do this across new features over time. Can prioritize and get multiple estimates. It’s a probabilistic measure.
Assessing impact – are people dropping off? Do people click on it? What are the complaints? They talk to experts using cold emails. They also cultivate a culture of educated guesses
Assess effort – get it wrong often and get better over time
They prioritize impact/effort with the least technical debt
They Spec & Build – (product, architecture, kickoff) to get organized
Use Clubhouse is their project tracker: readable by humans
Architecture spec to solve today’s problem, but look ahead. Eg.. initial architecture – used wordnet, elastic search, but found that elastic search was too slow so they moved to a graph database.
Build – build as little as possible; prototype; adjust your plan
Deploy – they will deploy things that are not worse (e.g. a button that doesn’t work yet)
They do code reviews to avoid deploying bad code
Paul Fisher @Phosphorus (from Recombine – formerly focused on the fertility space: carrier-screening. Now emphasize diagnostic DNA sequencing) talked about the processes they use to analyze DNA sequences. With the rapid development of laboratory technique, it’s a computer science question now. Use Scala, Ruby, Java.
Sequencers produce hundreds of short reads of 50 to 150 base pairs. They use a reference genome to align the reads. Want multiple reads (depth of reads) to create a consensus sequence
To lower cost and speed their analysis, they focus on particular areas to maximize their read depth.
They use a variant viewer to understand variants between the person’s and the reference genome:
- SNPs – one base is changed – degree of pathogenicity varies
- Indels – insertions & deletions
- CNVs – copy variations
They use several different file formats: FASTQ, Bam/Sam, VCF
Current methods have evolved to use Spark, Parquet (columnar storage db), and Adam (use Avro framework for nested collections)
Use Zepplin to share documentation: documentation that you can run.
Finally, Andrew Hogue @BlueApron spoke about the challenges he faces as the CTO. These include
Demand forecasting – use machine learning (random forest) to predict per user what they will order. Holidays are hard to predict. People order less lamb and avoid catfish. There was also a dip in orders and orders with meat during Lent.
Fulfillment – more than just inventory management since recipes change, food safety, weather, …
Subscription mechanics – weekly engagement with users. So opportunities to deepen engagement. Frequent communications can drive engagement or churn. A/B experiments need more time to run
BlueApron runs 3 Fulfillment centers for their weekly food deliveries: NJ, Texas, CA shipping 8mm boxes per month.
DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.
Posted on June 18th, 2016
06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)
The four speakers were
- Nitay Joffe, Founder and CTO of ActionIQ (next-generation data platform for marketing and consumer data)
- Adam Kanouse, CTO of Narrative Science (transforms data into meaningful and insightful narratives)
- Neha Narkhede, Founder and CTO of Confluent (real-time data platform built around Apache Kafka)
- Christopher Nguyen, Founder and CEO of Arimo (data intelligence platform)
Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.
They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.
Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.
#Visualization Metaphors: unraveling the big picture
Posted on May 19th, 2016
05/18/2016 @TheGraduateCenter CUNY, 365 5th Ave, NY
Manuel Lima ( @mslima ) @Parsons gave examples of #data representations. He first looked back 800 years and talked about Ars Memorativa, the art of memory , a set of mnemonic principals to organize information: e.g. spatial orientation, order of things on paper, chunking, association (to reinforce relations), affect, repetition. (These are also foundation principals of #Gestalt psychology).
Of the many metaphors, trees are most used: e.g. tree of life and the tree of good and evil. geneology, evolution, laws, …
Manuel then talked about how #trees work well for hierarchical systems, but we are looking more frequently at more complex systems. In science, for instance:
17-19th century – single variable relationships
20th century – systems of relationships (trees)
21st century – organized complexity (networks)
Even the tree of life can be seen as a network once bacteria’s interaction with organisms is overlaid on the tree.
He then showed various 15 distinct typologies for mapping networks and showed works of art inspired by networks (the new networkism) : 2-d: Emma McNally, 3-d: Tomas Saraceno and Chiharu Shiota.
The following authors were suggested as references on network visualization: Edward Tufte, Jacques Bertin (French philosopher), and Pat Hanrahan (a computer science prof at Stanford extended his work, also one of the founders of Tableau)
Industrial #Wearables & #IoT
Posted on May 17th, 2016
05/17/2016 @ Manhattan Ballroom, 29 W 36th Street, NY, 2nd floor
Anupam Sengupta @GuardHat (industrial safety helmet fitted with sensors monitoring 42 conditions) spoke about the helmet and the data back end. The hat is fitted with camera and microphone along with sensor for biometrics, geolocation, toxic gas, etc. The helmet is not sold, but will be available as a B2B service by year end.
Over an 8 hour shift it transmits 20 Mbytes of data. A typical work site would have from 100 to 300 workers and up to 3 shifts per day.
There is local processing on the device and data are sent real time for aggregation. Time to detect an event is 2 seconds. At the aggregation point, external data are added: weather data, location data as the building site changes.
HPCC is the back end with Lambda architecture so the data can be processed both in real time and for historical analysis.
Design considerations include:
Lack of reliability in data channel; “event stop” – data volume exceeds plan (here several people are involved with the event); devices don’t send the data stream every time; schema varies over time with conditions; temporal sequence not guaranteed
Encryption AS 256 for data transmission and storage
Radiation shielding within the helmet
Tracking limitations as agreed upon by unions and companies to preserve privacy
#DeepLearning and #NeuralNets
Posted on May 16th, 2016
05/16/2016 @Qplum, 185 Hudson Street, Suite 1620 Plaza 5, Jersey City, NJ
#Raghavendra Boomaraju @ Columbia described the math behind neural nets and how back propagation is used to fit models.
Observations on deep learning include:
- Universal approximation theory says you can fit any model with one hidden layer, provided the layer has a sufficient number of levels. But multiple hidden layers work better. The more layers, the fewer levels you need in each layer to fit the data.
- To optimize the weights, back-propagate the loss function. But one does not need to optimize the g() function since g()’s are designed to have a very general shape (such as the logistic)
- Traditionally, fitting has been done by changing all inputs simultaneously (deterministic) or changing one input at a time during optimization (stochastic inputs) . More recently, researchers are changing subsets of the inputs (minibatches).
- Convolution operators are used to standardize inputs by size and orientation by rotating and scaling.
- There is a way to Visualize within a neural network – see http://colah.github.io/
- The gradient method used to optimize weights needs to be tuned so it is neither too aggressive nor too slow. Adaptive learning (Adam algorithm) is used to determine the optimal step size.
- Deep learning libraries include Theano, café, Google tensor flow, torch.
- To do unsupervised deep learning – take inputs through a series of layers that at some point have fewer levels than the number of inputs. The ensuing layers expand so the number of points on the output layer matches that of the input layer. Optimize the net so that the inputs match the outputs. The layer with the smallest number of point s describes the features in the data set.
Evolving from #RDBMS to #NoSQL + #SQL
Posted on May 3rd, 2016
05/03/2016 @Thoughtworks, 99 Madison Ave, 15th floor, NY
Jim Scott @MAPR spoke about #ApacheDrill which has a query language that extends ANSI SQL. Drill provides an interface that uses this SQL-extension to access data in underlying db’s that are SQL, noSQL, csv, etc.
The Ojai API has the following advantages
- Gson (in #Java) uses two lines of code to serialize #JSON to place into the data. One line to deserialize
- Idempotent – so don’t need to worry about replaying actions things twice if there is an issue.
- Drill does not requires Java, but not Hadoop so it can run on a desktop
- Schema on the fly – will take different data formats and join them together: e.g. csv + JSON
- Data is directly access from the underlying databases without needing to first transform them to a metastore
- Security – plugs into authentication mechanism of the underlying dbs. Mechanisms can go through multiple chains of ownership. Security can be done on row level and column level.
- Commands extend SQL to allow access lists in a JSON structure
- Can create views to output to parquet, csv, json formats
- FLATTEN – explode an array in a JSON structure to display as multiple rows with all other fields duplicated
Once upon a #Graph
Posted on April 21st, 2016
04/20/2016 @ Davis Auditorium, Columbia U., NY
Jennifer Chayes @Microsoft talked about research characterizing large #networks. The is especially relevant given the apparently non-random friendship patterns on #Facebook with stars having exceptionally large numbers of connects and clusters based on geography of socioeconomic status, etc. We assume that it is unlikely that a totally randomly generated graph would have a similar structure. If it’s not random, then how do we characterize the network? Since the networks are extremely large, many of the classical measures of network structure such as maximum distances, etc. need to viewed for the POV of their limit properties as the network grows very large over time.
One key concept is the stochastic #block models used to describe social structure in groups (Holland & Leinhardt). Here the points in the network are divided into k species with different propensities to link to members in their own group and with members in each other group.
She reviewed work done on dense graphs and finite graphs, highlighting the cut norm (Frieze-Kannan) which characterizes the network structure by the number of edges that need to be cut to divide the network into two parts, considering all possible permutations of the points in the network. By estimating a parameter W one can characterize the network as its size increases to a limit.
Social networks, however, are sparse. This affects the estimation of W since W converges to zero as a sparse network increases to a limit. To get around this problem, Jennifer proposed two methods to weight the edges by the network density. The measures produce different estimates, but they converge to non-zero values for sparse matrices when joined with extend stochastic block models.
The last question she explored by what statistical information can be released without violating privacy. The key concept is having statistics in which the deletion of any point (individual) does not change the network statistics by more than some epsilon.
Accelerating Data Science with Spark and Zeppelin
Posted on February 17th, 2016
02/17/2016 @ADP, 135 West 18th Street, NY
Two presenters talked about tools designed to work with Spark and other data analysis tools.
Oleg @Hortonworks spoke about #Apache #NiFi, an interactive GUI to orchestrate the flow of data. The drag and drop interface specifies how cases are streamed across applications with limited capabilities to filter or edit the data streams. The tool handles asynchronous processes and can be used to document how each case passes from process to process.
Next , Vinay Shukla talked about# Zeppelin, a notebook similar to #Jupyter (iPython) for displaying simple graphics (such as bar, line, and pie charts) in a browser window. It supports multiple languages on the back end (such as #Python) and is integrated with #Spark and #Hadoop.
Support for #R and visualizations at the level of #ggplot are scheduled for introduction in the second half of 2016.
Several audience members asked questions about who would use NiFi and Zeppelin since the tools do not have the analytic power of R or other analysis tools, yet their use requires more data sophistication than when using Excel or other business presentation tools.
Posted on January 26th, 2016
01/26/2016 @UrbanFuturesLab, 19th floor, 15 MetroTech, Jay Street, Brooklyn
David Moore @PPF spoke about his organization’s projects to make it easier to access/track legislation at the Federal, state and local levels. The NY City Council has a web site, but it’s hard to find information. #Councilmatic provides a user-friendly interface that can be searched by committee meetings, your council member, bills being heard, etc. The site makes it easier to find all proposed legislation on an issue, track the hearings on the bill, submit comments to council members, etc.
NYC.councilmatic.org covers the New York city council and was launched on Sep 30, 2015. They have similar sites for Chicago and Philadelphia. PPF is acquiring the resources to add other U.S. cities.
Earlier, David created the OpenCongress site which is a user friendly interface on federal bills including bill summary, annotate bill text, etc. The data were acquired by screen scraping public sites.
Next, OpenState created similar sites for each state.
The Sunlight foundation provided some funding for these ventures, but now concentrates only on the Federal and State level. PPF concentrates on cities.
To standardize their data presentation, PPF also created a data standard: Open civic data.