New York Tech Journal
Tech news from the Big Apple

DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.

Posted on June 18th, 2016

#DataDrivenNYC

06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)

20160613_183900 20160613_185245 20160613_191943 20160613_194901

The four speakers were

Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.

They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.

Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.

Read more…

posted in:  data, data analysis, Data Driven NYC, Data science, databases, Open source    / leave comments:   No comments yet

Evolving from #RDBMS to #NoSQL + #SQL

Posted on May 3rd, 2016

#SQLNYC

05/03/2016 @Thoughtworks, 99 Madison Ave, 15th floor, NY

20160503_190816[1] 20160503_192802[1] 20160503_194637[1]

Jim Scott @MAPR spoke about #ApacheDrill which has a query language that extends ANSI SQL. Drill provides an interface that uses this SQL-extension to access data in underlying db’s that are SQL, noSQL, csv, etc.

The Ojai API has the following advantages

  1. Gson (in #Java) uses two lines of code to serialize #JSON to place into the data. One line to deserialize
  2. Idempotent – so don’t need to worry about replaying actions things twice if there is an issue.
  3. Drill does not requires Java, but not Hadoop so it can run on a desktop
  4. Schema on the fly – will take different data formats and join them together: e.g. csv + JSON
  5. Data is directly access from the underlying databases without needing to first transform them to a metastore
  6. Security – plugs into authentication mechanism of the underlying dbs. Mechanisms can go through multiple chains of ownership. Security can be done on row level and column level.
  7. Commands extend SQL to allow access lists in a JSON structure
    1. CONTAINS
    2. SUM
    3. Can create views to output to parquet, csv, json formats
    4. FLATTEN – explode an array in a JSON structure to display as multiple rows with all other fields duplicated

posted in:  data, data analysis, databases, Open source, Programming    / leave comments:   No comments yet

#NoSQL Databases & #Docker #Containers: From Development to Deployment

Posted on April 26th, 2016

#SQLNYC

04/26/2016 @ThoughtWorks 99 Madison Ave., 15th Floor, New York, NY

20160426_190408[1]

Alvin Richards, VP of Product, @Aerospike spoke about employing Aerospike in Docker containers.

He started by saying that database performance demands including cache and dataLakes have made deployment complex and inefficient. Containers were developed to simplify deployment. They are similar to virtual machines, but describe the OS, programs and environmental dependencies in a standard format file. Components are

  1. Docker file with names + directory + processes to run to setup. OCI is the open container standard.
  2. Docker Compose orchestrates containers
  3. Docker Swarm orchestrates clustering of machines
  4. Docker Machine provisions machines.

Containers share root images (such as the Python image file).

Aerospike is a key value store which is built on the bare hardware (does not call the OS) for speed. It also automates data replication across nodes.

When Aerospike is run in Docker containers

  1. All nodes perform the same function – automated replication.
  2. The nodes self discover other nodes to balance the load & replication
  3. Application needs to understand the topology as it changes

In development, the data are often kept in the container since one usually wants to delete the development data when the development server is decommissioned. However, production servers usually don’t hold the data since these servers may be brought up and down, but the data is always retained.

posted in:  databases, Open source, Programming    / leave comments:   No comments yet

#PostGresSQL conf 2016: #InternetOfThings in industry

Posted on April 19th, 2016

#PGConfus2016  #pgconfus  #postgresql

04/19/2016 @ New York Marriott Brooklyn Bridge
20160419_102838[1]

Parag Goradia @GE talked about the transformation of industrial companies into software and analytics companies. He talked about three ingredients of this transformation

  1. Brilliant machines – reduce downtime, make more efficient = asset performance management
  2. Industrial big data
  3. People & work – the field force an service engineers

Unlike the non-industrial internet of things, sensors are already in place within machinery such as jet engines. However, until recently, only summary data when uploaded into the cloud. Industrial users are moving toward storing a much fuller data set. For instance, a single airline flight may yield 500gig of data.

The advantages of improved performance and maintenance are huge. He estimates over $ 1 trillion per year in savings. With this data collection goal, GE has created Predix, a cloud store using PostGres. Key features are

  1. Automated health care checks/backups
  2. 1 touch scalability
  3. No unscheduled downtime
  4. Protected data sets

Parag talked about selected applications of this technology

  1. Intelligent cities – using existing infrastructure such as street lights add sensors
  2. Health cloud advanced visualization – putting scans on the cloud for 3-d visualization and analytics

posted in:  Big data, databases, Internet of Things, Open source    / leave comments:   No comments yet

#PostGresSQL conf 2016: #SQL vs #noSQL

Posted on April 18th, 2016

#PGConf2016

04/18/2016 @ New York Marriott Brooklyn Bridge

20160418_151032[1]

The afternoon panel was composed of vendors providing both SQL and noSQL database access. The discussion emphasized that the use of a #SQL vs #noSQL database is primarily driven by

  1. The level of comfort developers/managers have in a SQL or noSQL
  2. Whether the discipline in SQL rows and fields were useful in creating applications
  3. Whether applications use JSON structures which are easily saved in a noSQL database
  4. Linking to existing applications can be done either using a SQL database or using an ORM to a noSQL database.

The available of ORM (Object-relational mapping) software blurs the lines between SQL and noSQL databases. However, one is advised to avoid using an ORM when using a noSQL database initially so one can gain familiarity in the differences between SQL and noSQL.

Both db types need planning to avoid problems and depends on the situation. For instance, sharding might be best done late when there is only a single application being developed. However, if one has many applications using the same infrastructure, one should consider specifying sharding policy early.

People want to avoid complexity, but don’t want to delegate setup to a standard default or procedure.

Controlling access can be done (even if there is no data in the object) by creating views which are accessible only by some users.

Geolocation data is best handled by specialized db’s like CardoDB: The coordinate system is not rectangular and data can be handled by sampling and aggregation.

posted in:  databases, Geolocation, Open source    / leave comments:   No comments yet

#PostGresSQL conf 2016: #DataSecurity

Posted on April 18th, 2016

#PGConf2016

04/18/2016 @ New York Marriott Brooklyn Bridge

20160418_093851[1]20160418_150214[1]

Several morning presentations concentrated on data security. Secure data sharing has several conflicting goals including

  1. Encourage data sharing
  2. Restricting who can see the data and who can pass on the rights to see the data
  3. Integrity of the original data including verification of the sender/originator
  4. Making protection transparent to users

Traditional approaches have concentrated on encryption that cannot be broken by outsiders and schemes so users can validate content and ensure confidential data transmission.

The following additional approaches/ideas were discusses

  1. Create a system when the data are self-protecting
  2. Enforce encryption at the object or attribute level?
  3. No security system will be water tight (except no access), so one needs to understand what are the major threats and what are the best solutions and the tradeoffs. Who can you trust?
  4. Can control be extended beyond the context of the system?
  5. Hardware Security Modules may offer more security but the raw key cannot be exported and could be hacked by the OS
  6. The use of point-to-point security systems so data is only available when needed.
  7. Point-to-point key management systems offer the possibility of understanding the usage patterns to identify security breaches

The problem is not software, it is the requirements. One wants to trust the user as much as possible and not rely on the service provider. However, in big organization, one needs a recovery key since there is premium on recovering lost data.

posted in:  databases, Open source, security    / leave comments:   No comments yet

DataDrivenNYC: #OpenSource business models, #databases, man-machine collaboration, text markup

Posted on March 16th, 2016

#DataDrivenNYC

03/18/2016 @AXA Equitable Center, 787 7th ave, NY

20160316_182835[1] 20160316_184731[1] 20160316_192255[1] 20160316_194308[1]

The two speakers and two interviewees were

Eric Colson @StitchFix (provides fashion to you, but you don’t pick anything) spoke about their hybrid system of selecting fashion for customers. In their business

  1. You create a profile
  2. You get five hand-picked items
  3. You keep what you like, send back the rest

This means that they need to get it right first. Computers do preference modeling better than humans and humans can better understand the idiosyncrasies of other humans, so StitchFix uses both

They first send the customer’s request for clothes to a recommender system and give the output of the recommender system to a human to provide a check and customize the offering.

Next Matt Turck interviewed Eliot Horowitz @MongoDB. Mongo was started in 2007 as the founders struggled with the database needs for their new venture. Eventually they decided that the db was the most interesting part of the platform so they make it open source to friend, friends of friends, etc.

It was initially implemented using a simple storage engine. But, as the user base grew they needed a more efficient storage engine. So in Summer 2014 they acquired WireTiger – (developers formerly from Berkeley DB) to create the new storage engine. They released it in March 2015 as version3.0,

In Dec 2015, they released version 3.2 which added encryption, an improved management deployment tool and a join operator.

In Dec 2015 they also started to sell packages such as a BI connector (e.g. connect to Tableau) and Compass (graphical viewer of the data structure).

In 2016 they will add a graph algorithm in version 3.4.

Eliot then talked ways to monetize open source

  1. Consulting and support
  2. Tools – e.g. BI connection or Compass – developers not interested, but business are
  3. Cloud services – manage DB, backups, upgrades, etc.

He also mentioned that the open source matters (they release using the AGPL license), so Amazon can’t resell as a supported service.

He said that their priorities are

  1. Make current users successful
  2. Make it easy for people migrate to Mongo
  3. Develop products for the cloud, which the future of databases

Next, Kieran Snyder @textio talked about their product helps users create better recruiting notices. She noted that the effectiveness of staff searches often depends on the effectiveness of emails and broadcast job descriptions. Their tool highlights words that either help or hurt the description.

They tagged 15 million job lists according to how many people applied, how long did it take to fill, what kinds of people applied, demographic mix, etc. They looked for patterns in the text and developed a scoring system within a word processor that evaluates the text and marks it up

  1. Score 0-100. In this market with a similar job title, how fast will this role be filled
  2. Green = phrases that drives success up
  3. Red = drives success down – e.g. term over used
  4. Look at structural – e.g. formatting (e.g. best if 1/3 of content is bulleted). E.g. percent of “you” vs “we” language.
  5. Model gender tone – blue vs purple highlighting
  6. Look for boilerplate

Textio has also been used to detect gender bias in other texts.

The best feedback is from their current clients.

They retrain the model every week since language evolves over time: phrases like ‘big data’ may be positive initially and eventually become negative as everyone uses the term.

Lastly, Matt Turck interviewed Peter Fenton @Benchmark, a west coast venture firm that has backed many winners.

In a wide ranging interview, Peter made many points including

  1. He looks for the entrepreneur to have a “dream about the possible world”
  2. This requires the investor to naively suppress the many reasons why it won’t work
  3. He looks for certain personality attributes in the entrepreneur
  4. “Deeply authentic” with “reckless passion”
  5. Charismatic with an idea that destroys the linear mindset
  6. In the product he looks for
  7. Identify attributes that are radically differentiated
  8. Product fits the market and the founder fits the market
  9. Is the structure conducive to radical growth
  10. Open source as an investment has two business models
  11. Packaging model – e.g. RedHat
  12. Open core – needs
    1. Product value
    2. A big enough market to get to gain platform status. If you want mass adoption, go as long as possible without considering monetization.
  13. Business models
  14. What is the sequence in your business – do you need mass adoption first
  15. Once mass adoption is achieved you need to protect it.
  16. The biggest issue is how to deal with Amazon, who can take open source and offer it on their platform
  17. Need to license in the right way.
  18. Amazon just needs to be good enough
  19. To succeed in open source you need a direct relationship with developers to protect your business model
  20. Azure has some advantages
  21. New Nvidea GPUs so they have a performance advantage
  22. Machine learning + big data = can create a new experience, but
    1. In the NBA there is no longer an advantage in the analytics – since everyone hires the same people. Data is the advantage.
    2. If you have a unique pool of information, you can add analytics.
  23. Capital is harder to get now.
    1. One of the main issues is that some business are so well capitalized that they currently don’t need to consider their burn rate. This mispricing forces all competitors into the same model of hypergrowth at the expense of sustainable growth and infrastructure improvement.
    2. Unlike 1999-2000, the bubble is driven by institutional money. Therefore it will take longer for the bubble to burst since institutions can hide the losses for longer (they do not mark investments to the equity market prices).
    3. When valuations eventually drop, many private companies will not go public, but will be acquired by larger companies.
  24. It takes years for an opportunity to be exploited: it took 5 years to go from a camera phone to Instagram. It took years to go from GPS to Uber. It is unclear whether blockchain is in gestation and what it will give rise to (bitcoin is the best application now)

 

posted in:  data analysis, Data Driven NYC, databases, Open source, startup    / leave comments:   No comments yet

DataDrivenNYC: #Hadoop, #BigData, #B2B

Posted on December 15th, 2015

#DataDrivenNYC

12/14/2015 @ Bloomberg, 731 Lexington Ave, NY

20151214_183057[1] 20151214_190526[1] 20151214_191804[1] 20151214_195703[1]

Four speakers talked about businesses that provide support for companies exploring large data sets.

M.C. Srivas, Co-Founder and CTO of MapR (Apache Hadoop distribution)

Nick Mehta, CEO of Gainsight (enterprise software for customer success)

Shant Hovsepian, Founder and CTO of Arcadia Data (visual analytics and business intelligence platform for Big Data)

Stefan Groschupf, Founder and CEO of Datameer (Big Data analytics application for Hadoop)

 

In the first presentation,  Shant Hovepian @ArcadiaData, CTO spoke about how to most effectively use big data to answer business questions. He summarized his points in 10 commandments for BI of big data

  1. Thou shalt not move big data – push computation close to data. Yarn, Mesos have made it possible to run a BI server next to the data. Use Mongo to create documents directly. Use native analysis engines.
  2. Thou shalt not steal or violate corporate security policies – use (RBAC) role based access control. Look for unified security models. Make sure there is an audit trail.
  3. Thou shalt not pay for every user or gigabyte – you need to plan for scalability, so be wary of pricing models that penalize you for increased adoption. Your data will grow quicker than you anticipate.
  4. Thou shalt covet thy neighbor’s visualization – publish pdf and png files. Collaboration is needed since no single person understands the entire data set.
  5. Thou shalt analyze thine data in its natural form – read free form. FIX format, JSON, tables,…
  6. Thou shalt not wait endless for thine results – understand and analyze your data quickly. Methods include building an OLAP cube, create temp tables, take samples of the data.
  7. Thou shalt not build reports instead build apps – users should interact with visual elements not text boxes or dropdowns-templates. Components should be reusable. Decouple data from the app (such as D3 does this on web pages).
  8. Thou salt use intelligent tools – look for tools with search built in. Automate what you can.
  9. Thou thalt go beyond the basics – make functionality available to users
  10. Thou shalt use Arcadia Data – a final pitch for his company.

Next, Stefan Groschupf@datameer advocated data driven decision making and talked about how the business decisions making should drive decisions make at the bottom of the hardware/software stack.

Stefan emphasized that business often do not know the analytics they will need to run, so designing a data base schema is not a first priority (He views a database schema as a competitive disadvantage). Instead he advocated starting at the hardware (which the hardest to change later) and designing up the stack.

He contrasted the processing models for YARN and Mesosphere as alternative ways of looking at processing either by grouping processes into hardware or grouping hardware to handle a single process.

Despite the advances in computer power and software, he views data as getting exponentially more complex thereby making the analyst’s job even harder. Interactive tools are the short term answer and deep learning eventually becoming more prominent.

The internet of things will offer even more challenges with more data and data sources and a wider variety of inputs.

Stefan recommended the movie Personal Gold  as a demonstration of the power of analytics – Olympic cyclists use better analytics to overcome a lack of funding versus their better funded rivals.

In the third presentation, M.C. Srivas @MapR talked about the challenges and opportunities of handing large data sets from a hardware perspective.

MapR was born from a realization that companies such as Facebook, Linkedin, etc. were all using Hadoop and building extremely large data bases. These data bases created hardware challenges and the hardware needed to be designed with the flexibility to grow as the industry changed.

M.C. gave three examples to highlight the enormous size of big data

  1. Email. One of the largest email providers holds 1.5B accounts with 200k new accounts created daily. Reliability is targeted to 6-9’s up time. Recent emails, those from the past 3 weeks, are put on flash storage. After that time they are put onto a hard disk. 95% of files are small, but the other 5% often include megabyte images and documents.
  2. Personal Records. 70% of Indians don’t have birth certificates – so these people are off the grid and as a result of open to abuse and financial exploitation. With the encouragement of the Indian government, MapR has built a system in India based on biometrics. They add 1.5mm people per day so in 2.5 years they expect everyone in India will be on the system.

The system is used for filing tax returns, airport security, access to ATMs, etc. The system replaces visas and reduces corruption.

  1. Streams. Self-driving car generate 1 terabyte of data per hour per car. Keeping this data requires massive onboard storage and edge processing along with data streamed. Long term storage for purposes of accident reporting etc. will requires large amounts of storage.

He concluded by talking about MapR’s views on open source software. They use open source code and will create open source APIs to fill gaps that they see. They view see good open source products, but advise consideration of whether an open source project is primarily controlled by a single company to avoid dependency on that company.

The presentations were concluded by Nick Mehta @Gainsight talking about providing value to B2B customers, each of whom has slightly different needs.

As a service provider to customers, there are 5 challenges

  1. Garbage in, garbage out – “what is going on before the customer leaves” – schemas change and keys are different for different departments, which part of the company are you selling to, new products and other changes over time.
  2. All customers are different – buy different products, different use cases, different amounts spent, customization
  3. Small sample size – too few data points for what you actually want to study
  4. How do you explain what the model does? vs “what percent of the time is this right?”
  5. Ease of driving change? – How do you get them to change their approach? There is a tendency for clients to look for the examples where the machine is wrong rather than looking at the bigger picture.

In response to these challenges, he made 5 suggestions

  1. Data quality – focus on recent data
  2. Data variance – analyze by segment
  3. Data points (small data) – use leading metrics. Use intuition on possible drivers: consider when user engagement drops off as a warning sign that the user might exit
  4. Model your understanding – compare proposals versus than random actions
  5. Drive change – track and automate actions

Nick concluded by saying that they use a wide variety of tools depending on the customer needs. Tools include Postgress, Mongo, Salesforce Amazon Redshift, etc.

 

posted in:  Big data, data, data analysis, Data Driven NYC, databases    / leave comments:   No comments yet

CodeDrivenNYC: bridging the #Culture gap between #software and #hardware, modern #SQL, #managing engineers

Posted on October 28th, 2015

#CodeDrivenNYC

10/28/2015 @FirstMark, 100 5th Ave, NY

20151028_182543[1] 20151028_184649[1] 20151028_192021[1]

Three speakers spoke

Colin Vernon @littleBits spoke about bridging the cultural gap between software and hardware engineers. This is especially important at a company that creates modules for users to assemble into standalone prototype devices (Internet of Things). From his position as Director of Platform, Colin talked about the wide diversity of skills needed to manage the hardware/software stack.

This diversity also leads to differences:

  1. Two cultures: software – agile, hardware = not agile
  2. Two paradigms: software – abstractions; hardware – simplest method is best.
  3. Communication styles: software – chatty; hardware – brevity and clarity

He recommends letting the cultures be different and to concentrate on touch points

  1. Identify what you have in common and strengthen it.
  2. Compromise, don’t pick anyone’s last choice. – e.g. currently doing things in Go
  3. Don’t meet in the middle (no overlap). Have both sides stretch to create an overlap.

Next, Spencer Kimball @Cockroach Labs compared their data base to other popular data bases. Cockroach aims to combine the best characteristics of SQL databases with the advantages of replication across many nodes such as survivability, consistency, and ease of deployment.

They like SQL since it is widely adopted, schemas are useful to clarify your thinking, and can also use the relational structure for complex data analytics. They have also extended SQL by allowing their databases to scale out across servers, create hierarchical tables & other modern features; don’t lock database when schema changes.

Spencer next talked about the key foundational ideas they have incorporated

  1. Started by considering Transactional Key Value store as a foundational building block.
  2. Provide fully distributed transactions.
    1. serializable by default.
    2. No strict locking – makes things faster, but increases chance of defaulting on shared resources.
  3. Others use consistent hashing scheme to locate where data is stored, but this slows sorting which makes it problematic for relationship databases.
  4. Use a bi-level index to get the best of range-segmented key space yet allow the db to expand
  5. Raft is a consensus algorithm that is simpler to understand than Paxos. It can replicate data which makes it robust and gives consistent answers. It is designed for strong consistency (as opposed to eventual consistency).

Finally, Duncan Glazier @Shopkeep talked about his methods to improve organizational efficiency and produce happy engineers. His main point was that the goals of engineers and managers should be aligned.

Everyone in an organization should have goals including challenges & a metric of success. By making these goals visible to all others in the company, everyone can see how their goals match those of management and others in the firm. He also feels that it is important to get feedback from managers and peers.

posted in:  Code Driven NYC, databases, Programming    / leave comments:   No comments yet

Android apps: #SQL and #Java, #Designers and #Developers, #Information hierarchies, #Translation

Posted on October 23rd, 2015

New York Android Developers

10/19/2015 @Facebook, 770 Broadway, NY

20151019_190859[1] 20151019_200348[1]

Four speakers talked about different applications and challenges of implementing apps for Android

Kevin Galligan @Touchlab spoke about accessing SQL databases using Java while retaining the best characteristics of object-oriented programming with relational databases.

Kevin talked about the family of Object Relational Mapping (ORM) utilities to perform this linkage. Comparing the offerings is based on performance and structural features. Performance issues revolve around the following questions:

  1. Handling hibernation
  2. Understanding the db structure
  3. Foreign references (parent and child)

However, amongst Android ORMs, there is not much difference on simple tables. For more complicated queries, source-gen is measurably faster than reflection-based. However, Kevin warned the audience not to trust the published benchmark performances.

He then offered a high level comparison of the main ORMs:

  1. ORMLite –lots of apps run this, however is it slow-ish,
  2. Active android – looks like support is dropping off
  3. GreenDAO – external model intrusion, fast (source-gen), single primary key
  4. DBFlow – source-gen, fast, poor foreign relations, support multiple primary keys
  5. Squeaky – APT port of ORMlite, source-gen, single primary key,
  6. Realm – not like SQLite, column oriented, single field primary key, queries tied to thread, fast, foreign calls simple, but it’s under development and is so not entirely stable

The second speaker was Max Ignatyev @Synpli.io which is a collaboration tool to improve the communications between designers and developers. It offers a single platform for design changes, thereby eliminating confusion whether communication is via dropbox, email, text, etc.

The designer uses Sketch or Photoshop and the developer sees specs in dp and a common color pallet which is integrated as ready-to-use assets in the IDE.

Sympli.io is currently a free offering in beta test.

Next, Liam Spradlin @Touchlab spoke about how users navigate through applications and how to make interfaces so users know what to do next as they complete tasks. He proposed an information hierarchy in which the users see what is to be done immediately in an invert pyramid:

  1. Primary information and actions
  2. Important details
  3. Background information

The last speaker, Mike Castleman @Meetup spoke about the challenges of making Meetup.com more accessible to non-English speakers.

The first challenge is to translate all strings, with sentence translation being especially important to avoid problems with gender and plurals on multiple words in a sentence. They considered third party translators such as Google, but decided to use in-house, native-speakers as translators as they know the product. They also provide context to translators by uploading screen shots. They use PhraseApp as their management tool to organize your translations.

Once translated, the layout needs to be altered as strings often become longer in other languages, such as German.

Dates and times have different forms and punctuations, so they use tools such as CLDR – (common locale data repository) and Libcore.icu.ICU to get the conventions correct.

Sorting strings and search can also be challenging as some languages, such as Japanese are sorted by the sound of the words, not the actually lexical representation of the spoken words.

posted in:  Android, databases, New York Android Developers    / leave comments:   No comments yet