DataDrivenNYC: making #BigData and #SmartData easier
Posted on March 18th, 2015
03/17/2015 @Bloomberg, 731 Lexington Ave. 7th Floor, New York, NY
The four presenters were
- Paul Dix, CEO of InfluxDB (open-source, distributed, time series database)
- Ben Medlock, Co-Founder and CTO of SwiftKey (smart prediction technology for easier mobile typing)
- Ion Stoica, CEO of Databricks (hosted Big Data service, founded by the creators of Apache Spark)
- Ryan Smith, Founder and CEO of Qualtrics(leading provider of online survey software for data driven decisions)
Paul Dix @influxDB first talked about why time series data, such as sensor data, dev ops and financial time series, are best held in a database especially designed for such data. The main argument is that a general data bases can handle time series data, but don’t scale as well as a specialized database. As the data grows in generic databases, the user either faces slower queries or is limited to the analysis of summary data. The alternative is to greatly increase the complexity of the supporting code to run the infrastructure.
InfluxDB provides the needed specialized infrastructure. The interface is a SQL-like query language which returns JSON format results. The database handles regular and irregular time series. It also allows you to write queries to determine the structure of the data.
InfluxDB is written in Go and unlike KDB is open source (as opposed to just being free).
Next, Ben Medlock @SwiftKey talked about how Swiftkey built the world’s smartest keyboard which is used on many smartphones and tablets. The goal is to improve the accuracy of keyboarding by using AI. Ben split the processes into three categories:
- context – next word prediction, language detection
- input – error correction using the proximity of keys on the keyboard
- prior knowledge
Language detection and next word prediction were originally based on a one-trillion-word, multi-language database. Swiftkey created several independent models of word distribution which are amalgamated to create a set of best guesses for the next word.
The keyboard model is based on the probability that a neighboring key is struck in error. Over time, the parameters of the Gaussian distribution of errors around each key becomes unique to each user.
Ben also talked about the need to place a profanity/obscenity filter on the vocabulary and how Swiftkey had recently built the new interface for Stephen Hawking’s speech generation tool. He also noted that his original work was on spam filtering software.
Next, Ion Stoica @databricks was interviewed by Matt Turck. Ion talked about how his initial work on distributing video over the internet at @Conviva lead to his search for real-time analysis tools for #BigData. In 2007 Hadoop had just come out, but it was a batch process and therefore not suitable for real-time work. He felt that Hadoop set a standard for handling storage and resources, but the computation layer could be improved. From this, Spark was developed. Spark brings the following advantages to the analysis of very large databases:
- Faster – optimized in-memory processing –
- Easy to use API – easier to write applications – has 100 operators beyond just map and reduce
- Unified platform – support streaming, interactive, query processing, graph based computation
Furthermore, analysis of streaming data can be done using a sequence of micro-batch jobs
Spark also provides out-of-the-box fault tolerance from the microbatch framework with minimum latencies in the hundredths of seconds. (But it’s architecture makes it hard to get the single-digit millisecond latency needed by certain applications such as high frequency trading).
Databricks is a cloud-based service built on top of Spark which adds cluster management tools, organization of work spaces, and dashboards.
Lastly, Ryan Smith @Qualtrics was interviewed by Matt Turck. Qualtrics recently was valued in excess of $1B and was started in Provo Utah by Ryan and his father. The goal was to make experimental design as easy as possible.
Initially the product was marketed to academics. As the students moved to the work force, they introduced their companies to the tool. The low-cost model was originally used to entice academic usage, but has proven important as middle level corporate users in research could easily budget for it and spread its use.
Ryan talked about how their product fits well in the current business models which stress customer feedback, empowerment of middle level corporate users to use new technology, and the search for efficient ways to analyze data with fewer analysts.
He also talked about how Qualtrics is building a ecosystem consisting of new ways to use their questionnaires and better ways to analyze and present results.