New York Tech Journal
Tech news from the Big Apple

DataDrivenNYC: making #BigData and #SmartData easier

Posted on March 18th, 2015

#DataDriven NYC

03/17/2015 @Bloomberg, 731 Lexington Ave. 7th Floor, New York, NY

The four presenters were

20150317_185504 20150317_183714

Paul Dix @influxDB first talked about why time series data, such as sensor data, dev ops and financial time series, are best held in a database especially designed for such data. The main argument is that a general data bases can handle time series data, but don’t scale as well as a specialized database. As the data grows in generic databases, the user either faces slower queries or is limited to the analysis of summary data. The alternative is to greatly increase the complexity of the supporting code to run the infrastructure.

InfluxDB provides the needed specialized infrastructure. The interface is a SQL-like query language which returns JSON format results. The database handles regular and irregular time series. It also allows you to write queries to determine the structure of the data.

InfluxDB is written in Go and unlike KDB is open source (as opposed to just being free).

Next, Ben Medlock @SwiftKey talked about how Swiftkey built the world’s smartest keyboard which is used on many smartphones and tablets. The goal is to improve the accuracy of keyboarding by using AI. Ben split the processes into three categories:

  1. context – next word prediction, language detection
  2. input – error correction using the proximity of keys on the keyboard
  3. prior knowledge

Language detection and next word prediction were originally based on a one-trillion-word, multi-language database. Swiftkey created several independent models of word distribution which are amalgamated to create a set of best guesses for the next word.

The keyboard model is based on the probability that a neighboring key is struck in error. Over time, the parameters of the Gaussian distribution of errors around each key becomes unique to each user.

Ben also talked about the need to place a profanity/obscenity filter on the vocabulary and how Swiftkey had recently built the new interface for Stephen Hawking’s speech generation tool. He also noted that his original work was on spam filtering software.

Next, Ion Stoica @databricks was interviewed by Matt Turck. Ion talked about how his initial work on distributing video over the internet at @Conviva lead to his search for real-time analysis tools for #BigData. In 2007 Hadoop had just come out, but it was a batch process and therefore not suitable for real-time work. He felt that Hadoop set a standard for handling storage and resources, but the computation layer could be improved. From this, Spark was developed. Spark brings the following advantages to the analysis of very large databases:

  1. Faster – optimized in-memory processing –
  2. Easy to use API – easier to write applications – has 100 operators beyond just map and reduce
  3. Unified platform – support streaming, interactive, query processing, graph based computation

Furthermore, analysis of streaming data can be done using a sequence of micro-batch jobs

Spark also provides out-of-the-box fault tolerance from the microbatch framework with minimum latencies in the hundredths of seconds. (But it’s architecture makes it hard to get the single-digit millisecond latency needed by certain applications such as high frequency trading).

Databricks is a cloud-based service built on top of Spark which adds cluster management tools, organization of work spaces, and dashboards.

Lastly, Ryan Smith @Qualtrics was interviewed by Matt Turck. Qualtrics recently was valued in excess of $1B and was started in Provo Utah by Ryan and his father.  The goal was to make experimental design as easy as possible.

Initially the product was marketed to academics. As the students moved to the work force, they introduced their companies to the tool. The low-cost model was originally used to entice academic usage, but has proven important as middle level corporate users in research could easily budget for it and spread its use.

Ryan talked about how their product fits well in the current business models which stress customer feedback, empowerment of middle level corporate users to use new technology, and the search for efficient ways to analyze data with fewer analysts.

He also talked about how Qualtrics is building a ecosystem consisting of new ways to use their questionnaires and better ways to analyze and present results.

posted in:  AI, Big data, Data Driven NYC    / leave comments:   No comments yet