New York Tech Journal
Tech news from the Big Apple

DataDrivenNYC: #Hadoop, #BigData, #B2B

Posted on December 15th, 2015


12/14/2015 @ Bloomberg, 731 Lexington Ave, NY

20151214_183057[1] 20151214_190526[1] 20151214_191804[1] 20151214_195703[1]

Four speakers talked about businesses that provide support for companies exploring large data sets.

M.C. Srivas, Co-Founder and CTO of MapR (Apache Hadoop distribution)

Nick Mehta, CEO of Gainsight (enterprise software for customer success)

Shant Hovsepian, Founder and CTO of Arcadia Data (visual analytics and business intelligence platform for Big Data)

Stefan Groschupf, Founder and CEO of Datameer (Big Data analytics application for Hadoop)


In the first presentation,  Shant Hovepian @ArcadiaData, CTO spoke about how to most effectively use big data to answer business questions. He summarized his points in 10 commandments for BI of big data

  1. Thou shalt not move big data – push computation close to data. Yarn, Mesos have made it possible to run a BI server next to the data. Use Mongo to create documents directly. Use native analysis engines.
  2. Thou shalt not steal or violate corporate security policies – use (RBAC) role based access control. Look for unified security models. Make sure there is an audit trail.
  3. Thou shalt not pay for every user or gigabyte – you need to plan for scalability, so be wary of pricing models that penalize you for increased adoption. Your data will grow quicker than you anticipate.
  4. Thou shalt covet thy neighbor’s visualization – publish pdf and png files. Collaboration is needed since no single person understands the entire data set.
  5. Thou shalt analyze thine data in its natural form – read free form. FIX format, JSON, tables,…
  6. Thou shalt not wait endless for thine results – understand and analyze your data quickly. Methods include building an OLAP cube, create temp tables, take samples of the data.
  7. Thou shalt not build reports instead build apps – users should interact with visual elements not text boxes or dropdowns-templates. Components should be reusable. Decouple data from the app (such as D3 does this on web pages).
  8. Thou salt use intelligent tools – look for tools with search built in. Automate what you can.
  9. Thou thalt go beyond the basics – make functionality available to users
  10. Thou shalt use Arcadia Data – a final pitch for his company.

Next, Stefan Groschupf@datameer advocated data driven decision making and talked about how the business decisions making should drive decisions make at the bottom of the hardware/software stack.

Stefan emphasized that business often do not know the analytics they will need to run, so designing a data base schema is not a first priority (He views a database schema as a competitive disadvantage). Instead he advocated starting at the hardware (which the hardest to change later) and designing up the stack.

He contrasted the processing models for YARN and Mesosphere as alternative ways of looking at processing either by grouping processes into hardware or grouping hardware to handle a single process.

Despite the advances in computer power and software, he views data as getting exponentially more complex thereby making the analyst’s job even harder. Interactive tools are the short term answer and deep learning eventually becoming more prominent.

The internet of things will offer even more challenges with more data and data sources and a wider variety of inputs.

Stefan recommended the movie Personal Gold  as a demonstration of the power of analytics – Olympic cyclists use better analytics to overcome a lack of funding versus their better funded rivals.

In the third presentation, M.C. Srivas @MapR talked about the challenges and opportunities of handing large data sets from a hardware perspective.

MapR was born from a realization that companies such as Facebook, Linkedin, etc. were all using Hadoop and building extremely large data bases. These data bases created hardware challenges and the hardware needed to be designed with the flexibility to grow as the industry changed.

M.C. gave three examples to highlight the enormous size of big data

  1. Email. One of the largest email providers holds 1.5B accounts with 200k new accounts created daily. Reliability is targeted to 6-9’s up time. Recent emails, those from the past 3 weeks, are put on flash storage. After that time they are put onto a hard disk. 95% of files are small, but the other 5% often include megabyte images and documents.
  2. Personal Records. 70% of Indians don’t have birth certificates – so these people are off the grid and as a result of open to abuse and financial exploitation. With the encouragement of the Indian government, MapR has built a system in India based on biometrics. They add 1.5mm people per day so in 2.5 years they expect everyone in India will be on the system.

The system is used for filing tax returns, airport security, access to ATMs, etc. The system replaces visas and reduces corruption.

  1. Streams. Self-driving car generate 1 terabyte of data per hour per car. Keeping this data requires massive onboard storage and edge processing along with data streamed. Long term storage for purposes of accident reporting etc. will requires large amounts of storage.

He concluded by talking about MapR’s views on open source software. They use open source code and will create open source APIs to fill gaps that they see. They view see good open source products, but advise consideration of whether an open source project is primarily controlled by a single company to avoid dependency on that company.

The presentations were concluded by Nick Mehta @Gainsight talking about providing value to B2B customers, each of whom has slightly different needs.

As a service provider to customers, there are 5 challenges

  1. Garbage in, garbage out – “what is going on before the customer leaves” – schemas change and keys are different for different departments, which part of the company are you selling to, new products and other changes over time.
  2. All customers are different – buy different products, different use cases, different amounts spent, customization
  3. Small sample size – too few data points for what you actually want to study
  4. How do you explain what the model does? vs “what percent of the time is this right?”
  5. Ease of driving change? – How do you get them to change their approach? There is a tendency for clients to look for the examples where the machine is wrong rather than looking at the bigger picture.

In response to these challenges, he made 5 suggestions

  1. Data quality – focus on recent data
  2. Data variance – analyze by segment
  3. Data points (small data) – use leading metrics. Use intuition on possible drivers: consider when user engagement drops off as a warning sign that the user might exit
  4. Model your understanding – compare proposals versus than random actions
  5. Drive change – track and automate actions

Nick concluded by saying that they use a wide variety of tools depending on the customer needs. Tools include Postgress, Mongo, Salesforce Amazon Redshift, etc.


posted in:  Big data, data, data analysis, Data Driven NYC, databases    / leave comments:   No comments yet