New York Tech Journal
Tech news from the Big Apple

DataDrivenNYC: #LawEnforcement, #DataWarehouse on the #Cloud, #GraphDatabase, #Productivity tool

Posted on April 14th, 2015

DataDriveNYC

04/14/2015 @Bloomberg, 731 Lexington Av. 7th Floor, New York, NY

The four speakers were

20150414_182428

The first speaker, Howie Liu @Airtable, previously founded a CRM company that was eventually sold to Salesforce, but he realized that he still needed to supplement the CRM with spreadsheets used as a makeshift databases. Airtable has the goal of creating a spreadsheet that can be accessed in real time from desktop and mobile devices.

The emphasis is on the flexibility of the spreadsheet as database as opposed to calculations. Much of the engineering has gone into giving real-time updates to everyone viewing the sheet and making the sheet so it looks like a series of linked cards when viewed on a mobile device. They also built facilities to easily create links between fields in the spreadsheet as one would fields in a database.

He urged everyone to try Airtable since it is free up to a given number of records.

20150414_184223

Next, Scott Crouch & EJ Bensing talked about their experience cleaning crime databases for police departments. Scott talked about the complexity of the problem

  1. Large number of fields in each arrest record: e.g. a police officer might need to fill in 344 fields for a domestic assault with a gun. More complicated arrests can have up to 3000 fields.
  2. Numerous records are duplicated making it difficult to assemble a suspect’s complete record while avoiding over-reporting. As many as 40% of the 5 million people with arrest records in Washington D.C. arrest records may be duplicates.
  3. The data must be merged carefully to avoid falsely assigning arrests to the wrong individual. This requires
    1. A full audit trail of all record mergers
    2. Oversight and approval of the police department
    3. The ability to reverse an incorrect merger
  4. Data security is paramount, so they rely on the work done at Amazon to create a U.S. government compliant database
  5. EJ talked about how they created quasi-identifiers to group records and how they use feedback loops to improve the fields they use in the quasi-identifiers and their computation of identifier-to-identifier distance measures.

The next speaker was Bob Muglia @Snowflake which creates a data warehouse on the public cloud. In this way, companies can continue to grow their warehouses while taking full advantage of the cloud

  1. Separation of storage and computation
  2. Data can be either structured data (typical for data warehouses) and semi-structured data (machine generated data which may have a list structure)
  3. Incorporation of best practices for data security
    1. 2-factor authentication
    2. data encryption
    3. granular access control
    4. process & procedure
    5. audit and certification
  4. Advantages of working in a public cloud
    1. Software is up to date
    2. Online upgrade
    3. Scalable logging and audit services
  5. Faster warehouse setup
    1. Data are stored in standard S3 Amazon storage
    2. The warehouse is spun up only on demand

20150414_195024

Emil Eifrem @Neo4J & NeoTechnology was the concluding speaker. Neo4J replaces the idea of a database filled with tables with a model in which nodes are connected by edges in a graph.

Emil described how thinking about the inner connections of the data points is more powerful than just looking at the points in isolations. He cited this as the crucial advantage that allowed some companies to overtake and dominate other companies with similar data

  1. First generation search engines such as Excite, Altavista and Lycos collected data from all web sites, but Google also considered the linkages across web sites
  2. Monster.com collected resumes, but LinkedIn also collected your professional network
  3. Banks collect consumer credit information, but Paypal also collects information on your network and the credit worthiness of individuals in that network.

He then showed how using linkages can lead to a competitive advantage in other industries such as telecommunications, package delivery, etc.

Emil then gave a short demonstration of the power of a graph database on a small database using the Cipher query language.

He concluded by saying that graphs consisting of nodes (e.g. people) connected by edges (e.g. “loves”) are easy to conceptualize, but are a powerful tool in improving business efficiency.

posted in:  applications, Data Driven NYC, databases    / leave comments:   1 comment