New York Tech Journal
Tech news from the Big Apple

DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.

Posted on June 18th, 2016


06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)

20160613_183900 20160613_185245 20160613_191943 20160613_194901

The four speakers were

Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.

They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.

Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.

Read more…

posted in:  data, data analysis, Data Driven NYC, Data science, databases, Open source    / leave comments:   No comments yet

#DataDrivenNYC: #FaultTolerant #Web sites, #Finance, Predicting #B2B buying behavior, training #DeepLearning

Posted on May 18th, 2016


05/18/2016 @AXA auditorium, 787 7th  Avenue, NY

20160518_182345[1] 20160518_184230[1] 20160518_191103[1] 20160518_193745[1]

Four speakers presented:

First, Nicolas Dessaigne @Algolia (Subscription service to access a search API) talked about the challenges building a highly fault-tolerant world-wide service. The steps resulted from their understanding of points of failure within their systems and the infrastructure their systems depend on.

Initially, they concentrated on their software development process including failed updates.  To overcome these problems, they update one server at a time (with a rack of servers), do partial updates, use Chef to automate deployment.

Then they migrated their DNS provider from .io to .net TLD to avoid slow response times they had seen intermittently in Asia. This was followed by the upgrades:

Feb 2015. Set up clusters of servers world-wide , so users have a server in their region:  lower latency

March 2015. Physically separate server clusters within a region to different providers

May 2015. Create fallback DNS servers

July 2015. Put a third data center online to make indexing robust

April 2016. Implement  a 1 second granularity for their system monitoring

Next, Matt Turck interviewed Louis DiModugno @AXA . In the US, AXA’s main focus is on predictive underwriting of insurance process. They also have projects to incorporate sensors into products and correctly route queries to call centers based on the demographics of the customer. World-wide they have three analysis hubs: France, US, Singapore (coming online).

Louis oversees both data and analytics in the U.S. and both he and the CTO report to the CIO.  They are interested in expanding their capabilities in areas such as creating unstructured databases from life insurance data that are currently on microfiche.

In the third presentation, Amanda Kahlow @6Sense talked about their business model  to provide information to customers in B2B commerce. They analyze business searches, customer web sites, visits to publisher’s (e.g. Forbes) web sites. Their goal is to determine the timing of customer purchases.

B2B purchases are different from B2C purchases since

  1. Businesses research their purchases online before they buy
  2. The research takes time (long sales cycle)
  3. The decision to buy involves multiple people within the company

So, there are few impulse buys and buyer behavior signals that a purchase is imminent.

The main CMO question is when (not who).

6sense ties data across searches (anonymous data). The goal is to identify when companies are in a specific part of the buying cycle, so sales can approach them now. (Example: show click-to-chat when the analytics says that the customer is ready to buy)

Lastly, Peter Brodsky @HyperScience  spoke about tools they are developing to speed machine learning. These include

  1. Tools to make it easier to add new data sets
  2. need to match fields, such as date which may be in different formats
  3. what to do with missing data
  4. need labeled data – lots of examples
  5. Speed up training time

The speed up is done by identifying subnets within the larger neural network. The subnets perform distinct functions. To determine if two subnets (in different networks) are equivalent, move one subnet from one network to replace another subnet in another network and see if the function is unchanged: Freeze the weights within the subnet and outside the subnet. Retrain the interface between the net and the subnet.

This creates building blocks which can be combined into larger blocks. These blocks can be applied to jump start the training process.


posted in:  AI, applications, Big data, data analysis, Data Driven NYC, startup    / leave comments:   No comments yet

DataDrivenNYC: #OpenSource business models, #databases, man-machine collaboration, text markup

Posted on March 16th, 2016


03/18/2016 @AXA Equitable Center, 787 7th ave, NY

20160316_182835[1] 20160316_184731[1] 20160316_192255[1] 20160316_194308[1]

The two speakers and two interviewees were

Eric Colson @StitchFix (provides fashion to you, but you don’t pick anything) spoke about their hybrid system of selecting fashion for customers. In their business

  1. You create a profile
  2. You get five hand-picked items
  3. You keep what you like, send back the rest

This means that they need to get it right first. Computers do preference modeling better than humans and humans can better understand the idiosyncrasies of other humans, so StitchFix uses both

They first send the customer’s request for clothes to a recommender system and give the output of the recommender system to a human to provide a check and customize the offering.

Next Matt Turck interviewed Eliot Horowitz @MongoDB. Mongo was started in 2007 as the founders struggled with the database needs for their new venture. Eventually they decided that the db was the most interesting part of the platform so they make it open source to friend, friends of friends, etc.

It was initially implemented using a simple storage engine. But, as the user base grew they needed a more efficient storage engine. So in Summer 2014 they acquired WireTiger – (developers formerly from Berkeley DB) to create the new storage engine. They released it in March 2015 as version3.0,

In Dec 2015, they released version 3.2 which added encryption, an improved management deployment tool and a join operator.

In Dec 2015 they also started to sell packages such as a BI connector (e.g. connect to Tableau) and Compass (graphical viewer of the data structure).

In 2016 they will add a graph algorithm in version 3.4.

Eliot then talked ways to monetize open source

  1. Consulting and support
  2. Tools – e.g. BI connection or Compass – developers not interested, but business are
  3. Cloud services – manage DB, backups, upgrades, etc.

He also mentioned that the open source matters (they release using the AGPL license), so Amazon can’t resell as a supported service.

He said that their priorities are

  1. Make current users successful
  2. Make it easy for people migrate to Mongo
  3. Develop products for the cloud, which the future of databases

Next, Kieran Snyder @textio talked about their product helps users create better recruiting notices. She noted that the effectiveness of staff searches often depends on the effectiveness of emails and broadcast job descriptions. Their tool highlights words that either help or hurt the description.

They tagged 15 million job lists according to how many people applied, how long did it take to fill, what kinds of people applied, demographic mix, etc. They looked for patterns in the text and developed a scoring system within a word processor that evaluates the text and marks it up

  1. Score 0-100. In this market with a similar job title, how fast will this role be filled
  2. Green = phrases that drives success up
  3. Red = drives success down – e.g. term over used
  4. Look at structural – e.g. formatting (e.g. best if 1/3 of content is bulleted). E.g. percent of “you” vs “we” language.
  5. Model gender tone – blue vs purple highlighting
  6. Look for boilerplate

Textio has also been used to detect gender bias in other texts.

The best feedback is from their current clients.

They retrain the model every week since language evolves over time: phrases like ‘big data’ may be positive initially and eventually become negative as everyone uses the term.

Lastly, Matt Turck interviewed Peter Fenton @Benchmark, a west coast venture firm that has backed many winners.

In a wide ranging interview, Peter made many points including

  1. He looks for the entrepreneur to have a “dream about the possible world”
  2. This requires the investor to naively suppress the many reasons why it won’t work
  3. He looks for certain personality attributes in the entrepreneur
  4. “Deeply authentic” with “reckless passion”
  5. Charismatic with an idea that destroys the linear mindset
  6. In the product he looks for
  7. Identify attributes that are radically differentiated
  8. Product fits the market and the founder fits the market
  9. Is the structure conducive to radical growth
  10. Open source as an investment has two business models
  11. Packaging model – e.g. RedHat
  12. Open core – needs
    1. Product value
    2. A big enough market to get to gain platform status. If you want mass adoption, go as long as possible without considering monetization.
  13. Business models
  14. What is the sequence in your business – do you need mass adoption first
  15. Once mass adoption is achieved you need to protect it.
  16. The biggest issue is how to deal with Amazon, who can take open source and offer it on their platform
  17. Need to license in the right way.
  18. Amazon just needs to be good enough
  19. To succeed in open source you need a direct relationship with developers to protect your business model
  20. Azure has some advantages
  21. New Nvidea GPUs so they have a performance advantage
  22. Machine learning + big data = can create a new experience, but
    1. In the NBA there is no longer an advantage in the analytics – since everyone hires the same people. Data is the advantage.
    2. If you have a unique pool of information, you can add analytics.
  23. Capital is harder to get now.
    1. One of the main issues is that some business are so well capitalized that they currently don’t need to consider their burn rate. This mispricing forces all competitors into the same model of hypergrowth at the expense of sustainable growth and infrastructure improvement.
    2. Unlike 1999-2000, the bubble is driven by institutional money. Therefore it will take longer for the bubble to burst since institutions can hide the losses for longer (they do not mark investments to the equity market prices).
    3. When valuations eventually drop, many private companies will not go public, but will be acquired by larger companies.
  24. It takes years for an opportunity to be exploited: it took 5 years to go from a camera phone to Instagram. It took years to go from GPS to Uber. It is unclear whether blockchain is in gestation and what it will give rise to (bitcoin is the best application now)


posted in:  data analysis, Data Driven NYC, databases, Open source, startup    / leave comments:   No comments yet

DataDrivenNYC: #Hadoop, #BigData, #B2B

Posted on December 15th, 2015


12/14/2015 @ Bloomberg, 731 Lexington Ave, NY

20151214_183057[1] 20151214_190526[1] 20151214_191804[1] 20151214_195703[1]

Four speakers talked about businesses that provide support for companies exploring large data sets.

M.C. Srivas, Co-Founder and CTO of MapR (Apache Hadoop distribution)

Nick Mehta, CEO of Gainsight (enterprise software for customer success)

Shant Hovsepian, Founder and CTO of Arcadia Data (visual analytics and business intelligence platform for Big Data)

Stefan Groschupf, Founder and CEO of Datameer (Big Data analytics application for Hadoop)


In the first presentation,  Shant Hovepian @ArcadiaData, CTO spoke about how to most effectively use big data to answer business questions. He summarized his points in 10 commandments for BI of big data

  1. Thou shalt not move big data – push computation close to data. Yarn, Mesos have made it possible to run a BI server next to the data. Use Mongo to create documents directly. Use native analysis engines.
  2. Thou shalt not steal or violate corporate security policies – use (RBAC) role based access control. Look for unified security models. Make sure there is an audit trail.
  3. Thou shalt not pay for every user or gigabyte – you need to plan for scalability, so be wary of pricing models that penalize you for increased adoption. Your data will grow quicker than you anticipate.
  4. Thou shalt covet thy neighbor’s visualization – publish pdf and png files. Collaboration is needed since no single person understands the entire data set.
  5. Thou shalt analyze thine data in its natural form – read free form. FIX format, JSON, tables,…
  6. Thou shalt not wait endless for thine results – understand and analyze your data quickly. Methods include building an OLAP cube, create temp tables, take samples of the data.
  7. Thou shalt not build reports instead build apps – users should interact with visual elements not text boxes or dropdowns-templates. Components should be reusable. Decouple data from the app (such as D3 does this on web pages).
  8. Thou salt use intelligent tools – look for tools with search built in. Automate what you can.
  9. Thou thalt go beyond the basics – make functionality available to users
  10. Thou shalt use Arcadia Data – a final pitch for his company.

Next, Stefan Groschupf@datameer advocated data driven decision making and talked about how the business decisions making should drive decisions make at the bottom of the hardware/software stack.

Stefan emphasized that business often do not know the analytics they will need to run, so designing a data base schema is not a first priority (He views a database schema as a competitive disadvantage). Instead he advocated starting at the hardware (which the hardest to change later) and designing up the stack.

He contrasted the processing models for YARN and Mesosphere as alternative ways of looking at processing either by grouping processes into hardware or grouping hardware to handle a single process.

Despite the advances in computer power and software, he views data as getting exponentially more complex thereby making the analyst’s job even harder. Interactive tools are the short term answer and deep learning eventually becoming more prominent.

The internet of things will offer even more challenges with more data and data sources and a wider variety of inputs.

Stefan recommended the movie Personal Gold  as a demonstration of the power of analytics – Olympic cyclists use better analytics to overcome a lack of funding versus their better funded rivals.

In the third presentation, M.C. Srivas @MapR talked about the challenges and opportunities of handing large data sets from a hardware perspective.

MapR was born from a realization that companies such as Facebook, Linkedin, etc. were all using Hadoop and building extremely large data bases. These data bases created hardware challenges and the hardware needed to be designed with the flexibility to grow as the industry changed.

M.C. gave three examples to highlight the enormous size of big data

  1. Email. One of the largest email providers holds 1.5B accounts with 200k new accounts created daily. Reliability is targeted to 6-9’s up time. Recent emails, those from the past 3 weeks, are put on flash storage. After that time they are put onto a hard disk. 95% of files are small, but the other 5% often include megabyte images and documents.
  2. Personal Records. 70% of Indians don’t have birth certificates – so these people are off the grid and as a result of open to abuse and financial exploitation. With the encouragement of the Indian government, MapR has built a system in India based on biometrics. They add 1.5mm people per day so in 2.5 years they expect everyone in India will be on the system.

The system is used for filing tax returns, airport security, access to ATMs, etc. The system replaces visas and reduces corruption.

  1. Streams. Self-driving car generate 1 terabyte of data per hour per car. Keeping this data requires massive onboard storage and edge processing along with data streamed. Long term storage for purposes of accident reporting etc. will requires large amounts of storage.

He concluded by talking about MapR’s views on open source software. They use open source code and will create open source APIs to fill gaps that they see. They view see good open source products, but advise consideration of whether an open source project is primarily controlled by a single company to avoid dependency on that company.

The presentations were concluded by Nick Mehta @Gainsight talking about providing value to B2B customers, each of whom has slightly different needs.

As a service provider to customers, there are 5 challenges

  1. Garbage in, garbage out – “what is going on before the customer leaves” – schemas change and keys are different for different departments, which part of the company are you selling to, new products and other changes over time.
  2. All customers are different – buy different products, different use cases, different amounts spent, customization
  3. Small sample size – too few data points for what you actually want to study
  4. How do you explain what the model does? vs “what percent of the time is this right?”
  5. Ease of driving change? – How do you get them to change their approach? There is a tendency for clients to look for the examples where the machine is wrong rather than looking at the bigger picture.

In response to these challenges, he made 5 suggestions

  1. Data quality – focus on recent data
  2. Data variance – analyze by segment
  3. Data points (small data) – use leading metrics. Use intuition on possible drivers: consider when user engagement drops off as a warning sign that the user might exit
  4. Model your understanding – compare proposals versus than random actions
  5. Drive change – track and automate actions

Nick concluded by saying that they use a wide variety of tools depending on the customer needs. Tools include Postgress, Mongo, Salesforce Amazon Redshift, etc.


posted in:  Big data, data, data analysis, Data Driven NYC, databases    / leave comments:   No comments yet

DataDrivenNYC: computer #security, #DeepLearning, distributing #tweets to web sites, customizing user experience

Posted on October 13th, 2015


10/12/2015 @Bloomberg, 731 Lexington Ave, NY

20151012_183104[1] 20151012_183443[1] 20151012_185641[1] 20151012_190056[1] 20151012_191304[1] 20151012_192512[1] 20151012_194157[1]

This month’s speakers were

Liz Crawford, CTO of Birchbox (discovery commerce platform for beauty products)

Richard Socher, Founder and CEO of MetaMind (artificial intelligence enterprise platform for automating image recognition and language understanding)

• Ramana Rao, CTO of Livefyre (real-time content marketing and engagement platform)

Oren Falkowitz, Founder and CEO of Area 1 Security (provides visibility into the next generation of unknown, sophisticated targeted attacks)


Oren Falkowitz @Area1Security talked about his company’s approach to computer security. The traditional approach to network/computer security uses layers of defense including fire walls, passwords, etc.

Area 1 Security starts from the approach that 97% of all attacks original through phishing and these attacks can circumvent most traditional defenses. In addition, these attacks can mine your data for months until they are discovered. In contrast to the traditional approach, they strives to better understand the attacker’s behaviors and look for the telltale network and external usage patterns that indicate attempts to probe your network.

Oren talked about how intrusions often are periodic and from specific parts of the world wide web. By looking within the network and for usage patterns across the entire www, they hope to detect intrusions quickly using

  1. Visibility (big data)
  2. Detection (deep learning)

He closed by displaying a map showing a snapshot of all activity world-wide on the web.

Next, Richard Socher @MetaMind spoke about how MetaMind brings deep learning tools to companies. MetaMind applies deep learning to vision and language, He demonstrated how the tool classifies images based on a standard vocabulary, but can also be taught new classifiers using a drag and drop web interface.

For instance, to create a classifier for BMW vs Audi vs Telsa, he dropped three sets of images into slots on the interface and the classifier (powered by GPUs) used these images to evaluate new images.

Other uses of this technology include:

  1. Examine each frame of a game video and find your company logo.
  2. Find your company logo on social media
  3. In Diabetic retinopathy, classify images within needing people that have spent years learning how to read the scans

They also have tools that do natural language understanding using deep learning algorithms. These can predict the sentiment of a sentence and extract the main actors and themes in that sentence.

Richard addressed the question of why deep learning have just gained traction after existing for decades with little commercial interest:

  1. Enough large data sets are now available
  2. Larger models can be created due to faster machines – GPUs, multi-core CPUs
  3. Lots of small algorithmic advances over the years

Ramana Rao @Liverfyre talked about his company which provides real-time tweets (and other internet updates) to web sites. They capture social media, organize it, and publish it.

They create walls showing the tweets that are being sent during a conference. They insert tweets to actively change web pages. These live updates encourage viewers to stay longer on the site which increases overall engagement and develops greater affinity for the brand.

Ramana illustrated the uses of these updates to create an “earnings wall” for wall street pages, show social media comments during a TV show, or display tweets during presidential debates.

Analytics tools show the total amount of time spent on a site. They also show types of behavior as users interact in different ways by noting likes, or posting comments.

The large volume of data (currently 350 million global uniques each month) requires a large system which they run on Amazon Web Services.

The large volume of communications also requires them to employ various filters. Their spam/abuse filters use word lists and regular expressions to detect specific patterns.  They also detect spam using bulk detection to look for multiple repeated messages. They also filter out nudity.

The fourth speaker, Liz Crawford @Birchbox spoke about Birchbox’s use of analytics. Birchbox is a beauty retailer which has a retail and online presence, but specializes in a personalized package of items they send to subscribers every month.

To create the personalized package and better market to their customers, they have a staff of data scientists and statistical analysts:

  1. Data scientist = Ph.D. who can write production code, embed into product development teams
  2. Statistical analysts = focus on analytics. Specialized into business concerns.
  3. Analysts = junior statistical analysts
  4. Data engineering (warehouse) embedded with platform engineering.

In this structure, data scientists work on methods to better use the user profiles to personalizing the box subscribers receive each month. The problem becomes an integer programming optimization problem since subscribers never receive the same thing twice and there are a limited numbers of specific samples available each month.  To further complicate the optimization, customers can request one item each month. The data scientists use the Gurobi solver.

They also make customized recommendations online.

They have outlets on web, offline, brick & mortar, apps, social media. They track each customer’s touches on all outlets to better understand customer preferences to get the right message to the right customer.


  1. You can build up your data science practice over time
  2. Data science and product development work well when organically partnered
  3. Don’t be limited by the data you have today – get the data you need

They use other standard market analytics, but they built their internal analytics so they can focus on their core competency. Early in their growth process they taught all their staff to use sql to better understand their data. She noted that the U.S. rollout of sql went well, but the European rollout had its challenges.

posted in:  applications, data, data analysis, Data Driven NYC, Media, startup    / leave comments:   No comments yet

Data Driven NYC: improving #CustomerSerivce, #pro-bono data analysis, analyzing #video

Posted on May 27th, 2015


05/26/2015 @ Bloomberg, Lexington Ave, NY

20150526_194745[1] 20150526_193037[1]

Four speakers talked about their approaches to data:

The first speaker, David Luan @ Dextro described his company which uses deep learning techniques to summarize and search videos by categories and items appearing in the videos.  The goal is to create a real-time automated method for understanding how consumers view videos.

They describe the video by a Salience graph showing the important themes in the video and a time line of when concepts/items are displayed.

Analysis of video is complicated as items are embedded in a context and information needs to be summarized at the correct level (not too low, such as there are ice skates, seats, lights, etc., but at the level of understanding that this is a specific hockey game). They also aim to use motion cues to give context to the items and segment the video into meaningful chunks.

They work with a taxonomy provided by the customer to create models based on the units wanted by the customer.

David talked about the challenges of speeding the computation using GPUs and how they eventually will incorporate metadata and the sound track.

The second speaker, Sameer Maskey @FuseMachines talked about how they use data science analysis to improve customer service.

He talked about the treasure trove of data generated in prior customer service interactions. These can be analyzed to improve the customer experience by

  1. Improving the ability of customers to find solutions using self service
  2. Empower customer service reps with tool that anticipate the flow of the conversation

Sameer mentioned several ways that this information can assist in these tasks:

  1. Expose information embedded in documents
  2. Considers what the user is looking at and predicts the types of questions that the user will ask.
  3. Train customer service reps using previous conversations. New rep talk to the system and see how the system responds.
  4. On a call, the system automatically brings up documents that might be needed.

Three fundamental problems are important

  1. Data to score – ranks answers
  2. Data to classes/labels – predict answer type
  3. Data to cluster – cluster topics

They currently do not have the sophistication to ask for further clarification or start a dialog such as “when is my next garbage collection?” which should be answered by the question, “what is you location within the city?”

Jake Porway @DataKind spoke about his program to use data for the greater good.

DataKind brings pro bono data scientists to improve the understanding of data by non-profits. They have had 10,000 analysts working on 100’s of projects. Projects include:

  1. org – kick starter for NYC public schools soliciting online donations. Applied semantics3 to automate the taxonomy. Can determine which types of schools ask for what types of categories of goods/services.
  2. Crisis Text Line – teen’s text if they are in need – note that 5% of users take up 40% of all services. Created a predictive model of when someone will become a repeat texter so they can intervene more quickly.
  3. GiveDirectly – money to the poorest places in Kenya & Uganda. Check Thatch vs iron roofs to determine which communities are the poorest – build a map of types of roofs in different communities by analyzing satellite imagery. Jake talked about the limitations of this method and how refining the specifications is part of the process.

Jake said they have recently set up a centralized project group that can initiate its own projects

The last speaker, Mahamoud El-Assir @Verizon talked in very general terms about how Verizon leveraging data analysis to improve customer experience. He talked about information about the various channels and various services can be used to better match a advertising and advice to the customer needs

  1. Talking to customers – rep can consider the TV, Data and equipment usage.
  2. Supervisors coach their agents in real time – types of calls and the resolution on calls
  3. Shift to cross-channel analysis (from silos for particular products)

posted in:  Data Driven NYC, video    / leave comments:   No comments yet

DataDrivenNYC: #LawEnforcement, #DataWarehouse on the #Cloud, #GraphDatabase, #Productivity tool

Posted on April 14th, 2015


04/14/2015 @Bloomberg, 731 Lexington Av. 7th Floor, New York, NY

The four speakers were


The first speaker, Howie Liu @Airtable, previously founded a CRM company that was eventually sold to Salesforce, but he realized that he still needed to supplement the CRM with spreadsheets used as a makeshift databases. Airtable has the goal of creating a spreadsheet that can be accessed in real time from desktop and mobile devices.

The emphasis is on the flexibility of the spreadsheet as database as opposed to calculations. Much of the engineering has gone into giving real-time updates to everyone viewing the sheet and making the sheet so it looks like a series of linked cards when viewed on a mobile device. They also built facilities to easily create links between fields in the spreadsheet as one would fields in a database.

He urged everyone to try Airtable since it is free up to a given number of records.


Next, Scott Crouch & EJ Bensing talked about their experience cleaning crime databases for police departments. Scott talked about the complexity of the problem

  1. Large number of fields in each arrest record: e.g. a police officer might need to fill in 344 fields for a domestic assault with a gun. More complicated arrests can have up to 3000 fields.
  2. Numerous records are duplicated making it difficult to assemble a suspect’s complete record while avoiding over-reporting. As many as 40% of the 5 million people with arrest records in Washington D.C. arrest records may be duplicates.
  3. The data must be merged carefully to avoid falsely assigning arrests to the wrong individual. This requires
    1. A full audit trail of all record mergers
    2. Oversight and approval of the police department
    3. The ability to reverse an incorrect merger
  4. Data security is paramount, so they rely on the work done at Amazon to create a U.S. government compliant database
  5. EJ talked about how they created quasi-identifiers to group records and how they use feedback loops to improve the fields they use in the quasi-identifiers and their computation of identifier-to-identifier distance measures.

The next speaker was Bob Muglia @Snowflake which creates a data warehouse on the public cloud. In this way, companies can continue to grow their warehouses while taking full advantage of the cloud

  1. Separation of storage and computation
  2. Data can be either structured data (typical for data warehouses) and semi-structured data (machine generated data which may have a list structure)
  3. Incorporation of best practices for data security
    1. 2-factor authentication
    2. data encryption
    3. granular access control
    4. process & procedure
    5. audit and certification
  4. Advantages of working in a public cloud
    1. Software is up to date
    2. Online upgrade
    3. Scalable logging and audit services
  5. Faster warehouse setup
    1. Data are stored in standard S3 Amazon storage
    2. The warehouse is spun up only on demand


Emil Eifrem @Neo4J & NeoTechnology was the concluding speaker. Neo4J replaces the idea of a database filled with tables with a model in which nodes are connected by edges in a graph.

Emil described how thinking about the inner connections of the data points is more powerful than just looking at the points in isolations. He cited this as the crucial advantage that allowed some companies to overtake and dominate other companies with similar data

  1. First generation search engines such as Excite, Altavista and Lycos collected data from all web sites, but Google also considered the linkages across web sites
  2. collected resumes, but LinkedIn also collected your professional network
  3. Banks collect consumer credit information, but Paypal also collects information on your network and the credit worthiness of individuals in that network.

He then showed how using linkages can lead to a competitive advantage in other industries such as telecommunications, package delivery, etc.

Emil then gave a short demonstration of the power of a graph database on a small database using the Cipher query language.

He concluded by saying that graphs consisting of nodes (e.g. people) connected by edges (e.g. “loves”) are easy to conceptualize, but are a powerful tool in improving business efficiency.

posted in:  applications, Data Driven NYC, databases    / leave comments:   1 comment

DataDrivenNYC: making #BigData and #SmartData easier

Posted on March 18th, 2015

#DataDriven NYC

03/17/2015 @Bloomberg, 731 Lexington Ave. 7th Floor, New York, NY

The four presenters were

20150317_185504 20150317_183714

Paul Dix @influxDB first talked about why time series data, such as sensor data, dev ops and financial time series, are best held in a database especially designed for such data. The main argument is that a general data bases can handle time series data, but don’t scale as well as a specialized database. As the data grows in generic databases, the user either faces slower queries or is limited to the analysis of summary data. The alternative is to greatly increase the complexity of the supporting code to run the infrastructure.

InfluxDB provides the needed specialized infrastructure. The interface is a SQL-like query language which returns JSON format results. The database handles regular and irregular time series. It also allows you to write queries to determine the structure of the data.

InfluxDB is written in Go and unlike KDB is open source (as opposed to just being free).

Next, Ben Medlock @SwiftKey talked about how Swiftkey built the world’s smartest keyboard which is used on many smartphones and tablets. The goal is to improve the accuracy of keyboarding by using AI. Ben split the processes into three categories:

  1. context – next word prediction, language detection
  2. input – error correction using the proximity of keys on the keyboard
  3. prior knowledge

Language detection and next word prediction were originally based on a one-trillion-word, multi-language database. Swiftkey created several independent models of word distribution which are amalgamated to create a set of best guesses for the next word.

The keyboard model is based on the probability that a neighboring key is struck in error. Over time, the parameters of the Gaussian distribution of errors around each key becomes unique to each user.

Ben also talked about the need to place a profanity/obscenity filter on the vocabulary and how Swiftkey had recently built the new interface for Stephen Hawking’s speech generation tool. He also noted that his original work was on spam filtering software.

Next, Ion Stoica @databricks was interviewed by Matt Turck. Ion talked about how his initial work on distributing video over the internet at @Conviva lead to his search for real-time analysis tools for #BigData. In 2007 Hadoop had just come out, but it was a batch process and therefore not suitable for real-time work. He felt that Hadoop set a standard for handling storage and resources, but the computation layer could be improved. From this, Spark was developed. Spark brings the following advantages to the analysis of very large databases:

  1. Faster – optimized in-memory processing –
  2. Easy to use API – easier to write applications – has 100 operators beyond just map and reduce
  3. Unified platform – support streaming, interactive, query processing, graph based computation

Furthermore, analysis of streaming data can be done using a sequence of micro-batch jobs

Spark also provides out-of-the-box fault tolerance from the microbatch framework with minimum latencies in the hundredths of seconds. (But it’s architecture makes it hard to get the single-digit millisecond latency needed by certain applications such as high frequency trading).

Databricks is a cloud-based service built on top of Spark which adds cluster management tools, organization of work spaces, and dashboards.

Lastly, Ryan Smith @Qualtrics was interviewed by Matt Turck. Qualtrics recently was valued in excess of $1B and was started in Provo Utah by Ryan and his father.  The goal was to make experimental design as easy as possible.

Initially the product was marketed to academics. As the students moved to the work force, they introduced their companies to the tool. The low-cost model was originally used to entice academic usage, but has proven important as middle level corporate users in research could easily budget for it and spread its use.

Ryan talked about how their product fits well in the current business models which stress customer feedback, empowerment of middle level corporate users to use new technology, and the search for efficient ways to analyze data with fewer analysts.

He also talked about how Qualtrics is building a ecosystem consisting of new ways to use their questionnaires and better ways to analyze and present results.

posted in:  AI, Big data, Data Driven NYC    / leave comments:   No comments yet

DataDrivenNYC: #BigData and field-specific knowledge

Posted on January 14th, 2015


01/13/2015 @Bloomberg

,20150113_183505[1] 20150113_191326[1]

Five presenters talked about variety of topics with an emphasis on how large data sets require #DataScientists using big data techniques working in conjunction with experts in the specific field of study.

In the first presentation, Shankar Mishra spoke about the steps used by @TheLadders to efficiently match candidates with job openings.  Candidate resumes are split into

  1. Profile – who I am
  2. Preferences & Activities – what do I want

These characteristics are matched against job descriptions (such as salary and title) to compute the distance between candidates and each job opening. He showed how the candidate self-description and job openings change over time. For instance, the number of number of job descriptions containing the word “subprime” peaked in 2007 while the number of resumes containing the same word peaked in 2008.

Michael Karasick @ IBMWatson talked about some of the initiatives taken by IBM to commercialize the technology behind its Jeopardy-conquering Watson engine. These include

  1. Janary 2014 created separate entity to market Watson
  2. Created 100m investment fund – apply to enroll at Watson Ecosystem
  3. 3 minority investments in startups
  4. Sponsored curriculums (an contests) in 10 schools to use the tool

The Watson product concentrates on IBM’s 40 years of work on Natural Language Processing and Machine learning.  NLP is emphasized since IBM has experience in machine translation, Turing task, open domain Q&A in real time. Also, most of world’s data is unstructured.

Watson/Jeopardy has been expanded from lookup of factoids to a general linguistic pipeline for other types of Q & A. Applications so far have been concentrated in a limited number of fields such as life sciences, homeland security, oncology, finance, and recipe invention. In many of these fields, the Q&A process has been expanded into discovery methods.

Cancer research has been at

  1. D. Anderson in Texas – clinical support for cancer patients. Feed patient record – match to literature and trained cases.
  2. Sloan Kettering – examine cases both synthetic and published to identify the evidence. In this case, explanations along with search results as people want to know why the system drilled down into the particular evidence reported.

Alexander Gillet @HowGoodRatings spoke about his company’s efforts to rate the sustainability of food products on grocery shelves. Ratings are based on 60 factors including environmental and social impacts of creating, processing and delivering particular food brands to the store. At several target supermarkets, ratings of up to three “stars” are placed on the tags below the item shelf.

By examining loyalty card data, they show that a great rating tends to increase dollar sales. He did not go into whether the increased revenues resulted from a higher volume of product sold or increased prices charged by the retailers for a product that consumer would consider a premium product.

Hanna Wallach @MsftResearch and UMass Amherst spoke about her research in computational social science to study social processes. She said that people are concerned their privacy when big data is analyzed since it is based on human behavior done a very granular level.  Her comments were always true of any social research even before the wide availability of big-data, but bear emphasizing as more researchers and companies make use of these approaches.

She split the research into four parts: Data, questions, methods, findings. Within each of these parts she emphasized the need to have social scientists involved in the analysis as well as experts in computer science and math/statistics. These disciplines are all needed in good research to avoid emphasis on narratives and intuitions that may not be backed by previous research.

She also talked about the random nature of the behavior sampled and recommended presenting the uncertainty of the conclusions (such as confidence intervals) as well as the conclusions themselves.

Finally, Chris Wiggins, chief data scientist at NYTimes and @Columbia U. talked about his experience bringing new data science methods into a journalistic environment. He talked about needing to respect the craft of journalism and that data scientists need to be good listeners. He mentioned the challenges of bringing DS methods to the newsroom as highlighted by the Buzzfeed release of an internal evaluation of the NY Times and the friction due to the change in ownership of the New Republic.

He also talked about the many ways that people think about the concept of ‘computer literacy’ and how is boils down to how one develops a critical understanding of the limits and beneficial uses of computers in specific contexts.

posted in:  Big data, data, data analysis, Data Driven NYC    / leave comments:   No comments yet

Data Driven NYC: the future of #DataScience

Posted on December 17th, 2014


12/16/2014 @Bloomberg, 731 Lexington Ave, NY

After Michael Li @TheDataIncubator talked about the skills needed by data scientists to transition from academia to industry, Matt Turck interviewed two pioneers in the use of big data and deep learning:

Professor Yann LeCun, Director of AI Research at Facebook (Professor LeCun is world-renowned for his work in machine learning and computer vision)

Mike Olson, Co-Founder and Chairman of Cloudera (one of the key companies of the Hadoop/Big Data ecosystem; has raised $1.2 Billion to date)


Michael Li runs The Data Incubator, a program that trains scientists and engineers with advanced degrees to work as data scientists.  A screening exam is part of the admissions process and he showed findings from these tests. Among his findings were

  1. Many students who claim to know python programming or linear regression struggle to correctly solve the exam problem
  2. Graduates of name universities only marginally outperform graduates from “lesser” schools.
  3. Math and economics majors performed best on the exam.

Mike Olson @Cloudera talked about his experiences prior to founding Cloudera, which included. dropped out of Berkeley several times, working for Mike Stonebreaker, working on Berkeley DB, and being the CEO of Sleepy Cat.

Mike spoke about the initial industry misunderstandings about the purpose of map-reduce until it was realized that Hadoop was not built to solve the problems that databases were solving.

He next spoke about how map reduce is evolving and how Cloudera is positioned as the technology develops. Key to their positioning is that map-reduce is open source, but Cloudera retains proprietary tools to manage the process. This means that the underlying big-data technology will advance quickly from contributions throughout the industry. The admin tools gives then a revenue stream, thereby avoiding the price compression faced by a company with a completely open-source system (such as SleepyCat).

In addition, Cloudera sees itself as providing the platform whose usage will grow as new applications are created. Therefore it is in their interest to encourage this application layer and not compete with the application developers. By encouraging a rich ecosystem of applications and an industry-wide foundation, Cloudera grows the application of big-data which increases the demand for Cloudera’s services.

Mike also mentioned why they sold a meaningful stake of the company to Intel. This sales creates a mutually beneficial partnership as well as providing additional defenses against potential acquirers such as IBM.

He sees the demand for big-data analytics exploding with the biggest increase coming in the demand for new applications.

Yann Lecun @Facebook spoke about his work on neural nets over the past 30 years. He described how neural nets evolved as new hardware and data allowed convolution networks to first handle handwriting recognition and eventually visual and speech recognition to reach the current successes achieved by Deep Learning algorithms.

Going forward he sees more and more of our interactions mediated by intelligence processes. At Facebook, algorithms decide which items will be most interesting for us to read. Some of the challenges he sees are creating algorithms to understand the structure of language. Unsupervised learning is also a challenge.

His closing observations were about whether there should be some limits on the usage of artificial intelligence and the autonomy of AI systems to initiate actions.

posted in:  data analysis, Data Driven NYC, databases    / leave comments:   1 comment