New York Tech Journal
Tech news from the Big Apple

Beyond Big: Merging Streaming & #Database Ops into a Next-Gen #BigData Platform

Posted on April 13th, 2017


04/13/2017 @Thoughtworks, 99 Madison Ave, New York, 15th floor

Amir Halfon, VP of Strategic Solutions, @iguazio talked about methods for speeding up a analytics linked to a large database. He started by saying that a traditional software stack accessing a db was designed to minimize the time taken to access slow disk storage. This is resulted in layers of software. Amir said that with modern data access and db architecture, processing is accelerated by a unified data engine that eliminate many of the layers. This also allows for the creation of a generic access of data stored in many different formats and a record-by-record security protocol.

To simplify development they only use AWS  and only interface with Kafka, Hadoop, Spark. They are not virtualization (eventually reaches a speed limit), they do the actual store.

Another important method is to use “Predicate pushdown” =’ select … where … <predicate>’; usually all data are retrieved and then culled; instead if the predicate is pushed down, only the relevant data is retrieved. A.k.a. as an “offload-engine”.

MapR is a competitor using the HDFS database, as opposed to rebuilding the system from scratch.

posted in:  Big data, databases    / leave comments:   No comments yet

#Self-learned relevancy with Apache Solr

Posted on March 31st, 2017


03/30/2017 @ Architizer , 1 Whitehall Street, New York, NT, 10th Floor

Trey Grainger @ Lucidworks covered a wide range of topics involving search.

He first reviewed the concept of an inverted index in which terms are extracted from documents and placed in an index which points back to the documents. This allows for fast searches of single terms or combinations of terms.

Next Trey covered classic relevancy scores emphasizing

tf-idf = how well a term described the document * how important is the term overall

He noted, however, the tf-idf’s values may be limited since it does not make use of domain-specific knowledge.

Trey then talked about reflected intelligence = self–learning search which uses

  1. Content
  2. Collaboration – how have others interacted with the system
  3. Context – information about the user

He said this method increases relevance by boosting items that are highly requested by others. Since the items boosted are those currently relevant to others, this allows the method to adapt quickly without need for manual curation of items.

Next he talked about semantic search which using its understanding of terms in the domain.

(Solr can connect to an RDF database to  leverage an ontology). For instance, one can run word2vec to extract terms and phrases for a query and them determine a set of keywords/phrases to best match the query to the contents of the db.

Also, querying a semantic knowledge graph can expand the search by traversing to other relevant terms in the db

posted in:  Big data, databases, Open source    / leave comments:   No comments yet

DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.

Posted on June 18th, 2016


06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)

20160613_183900 20160613_185245 20160613_191943 20160613_194901

The four speakers were

Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.

They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.

Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.

Read more…

posted in:  data, data analysis, Data Driven NYC, Data science, databases, Open source    / leave comments:   No comments yet

Evolving from #RDBMS to #NoSQL + #SQL

Posted on May 3rd, 2016


05/03/2016 @Thoughtworks, 99 Madison Ave, 15th floor, NY

20160503_190816[1] 20160503_192802[1] 20160503_194637[1]

Jim Scott @MAPR spoke about #ApacheDrill which has a query language that extends ANSI SQL. Drill provides an interface that uses this SQL-extension to access data in underlying db’s that are SQL, noSQL, csv, etc.

The Ojai API has the following advantages

  1. Gson (in #Java) uses two lines of code to serialize #JSON to place into the data. One line to deserialize
  2. Idempotent – so don’t need to worry about replaying actions things twice if there is an issue.
  3. Drill does not requires Java, but not Hadoop so it can run on a desktop
  4. Schema on the fly – will take different data formats and join them together: e.g. csv + JSON
  5. Data is directly access from the underlying databases without needing to first transform them to a metastore
  6. Security – plugs into authentication mechanism of the underlying dbs. Mechanisms can go through multiple chains of ownership. Security can be done on row level and column level.
  7. Commands extend SQL to allow access lists in a JSON structure
    2. SUM
    3. Can create views to output to parquet, csv, json formats
    4. FLATTEN – explode an array in a JSON structure to display as multiple rows with all other fields duplicated

posted in:  data, data analysis, databases, Open source, Programming    / leave comments:   No comments yet

#NoSQL Databases & #Docker #Containers: From Development to Deployment

Posted on April 26th, 2016


04/26/2016 @ThoughtWorks 99 Madison Ave., 15th Floor, New York, NY


Alvin Richards, VP of Product, @Aerospike spoke about employing Aerospike in Docker containers.

He started by saying that database performance demands including cache and dataLakes have made deployment complex and inefficient. Containers were developed to simplify deployment. They are similar to virtual machines, but describe the OS, programs and environmental dependencies in a standard format file. Components are

  1. Docker file with names + directory + processes to run to setup. OCI is the open container standard.
  2. Docker Compose orchestrates containers
  3. Docker Swarm orchestrates clustering of machines
  4. Docker Machine provisions machines.

Containers share root images (such as the Python image file).

Aerospike is a key value store which is built on the bare hardware (does not call the OS) for speed. It also automates data replication across nodes.

When Aerospike is run in Docker containers

  1. All nodes perform the same function – automated replication.
  2. The nodes self discover other nodes to balance the load & replication
  3. Application needs to understand the topology as it changes

In development, the data are often kept in the container since one usually wants to delete the development data when the development server is decommissioned. However, production servers usually don’t hold the data since these servers may be brought up and down, but the data is always retained.

posted in:  databases, Open source, Programming    / leave comments:   No comments yet

#PostGresSQL conf 2016: #InternetOfThings in industry

Posted on April 19th, 2016

#PGConfus2016  #pgconfus  #postgresql

04/19/2016 @ New York Marriott Brooklyn Bridge

Parag Goradia @GE talked about the transformation of industrial companies into software and analytics companies. He talked about three ingredients of this transformation

  1. Brilliant machines – reduce downtime, make more efficient = asset performance management
  2. Industrial big data
  3. People & work – the field force an service engineers

Unlike the non-industrial internet of things, sensors are already in place within machinery such as jet engines. However, until recently, only summary data when uploaded into the cloud. Industrial users are moving toward storing a much fuller data set. For instance, a single airline flight may yield 500gig of data.

The advantages of improved performance and maintenance are huge. He estimates over $ 1 trillion per year in savings. With this data collection goal, GE has created Predix, a cloud store using PostGres. Key features are

  1. Automated health care checks/backups
  2. 1 touch scalability
  3. No unscheduled downtime
  4. Protected data sets

Parag talked about selected applications of this technology

  1. Intelligent cities – using existing infrastructure such as street lights add sensors
  2. Health cloud advanced visualization – putting scans on the cloud for 3-d visualization and analytics

posted in:  Big data, databases, Internet of Things, Open source    / leave comments:   No comments yet

#PostGresSQL conf 2016: #SQL vs #noSQL

Posted on April 18th, 2016


04/18/2016 @ New York Marriott Brooklyn Bridge


The afternoon panel was composed of vendors providing both SQL and noSQL database access. The discussion emphasized that the use of a #SQL vs #noSQL database is primarily driven by

  1. The level of comfort developers/managers have in a SQL or noSQL
  2. Whether the discipline in SQL rows and fields were useful in creating applications
  3. Whether applications use JSON structures which are easily saved in a noSQL database
  4. Linking to existing applications can be done either using a SQL database or using an ORM to a noSQL database.

The available of ORM (Object-relational mapping) software blurs the lines between SQL and noSQL databases. However, one is advised to avoid using an ORM when using a noSQL database initially so one can gain familiarity in the differences between SQL and noSQL.

Both db types need planning to avoid problems and depends on the situation. For instance, sharding might be best done late when there is only a single application being developed. However, if one has many applications using the same infrastructure, one should consider specifying sharding policy early.

People want to avoid complexity, but don’t want to delegate setup to a standard default or procedure.

Controlling access can be done (even if there is no data in the object) by creating views which are accessible only by some users.

Geolocation data is best handled by specialized db’s like CardoDB: The coordinate system is not rectangular and data can be handled by sampling and aggregation.

posted in:  databases, Geolocation, Open source    / leave comments:   No comments yet

#PostGresSQL conf 2016: #DataSecurity

Posted on April 18th, 2016


04/18/2016 @ New York Marriott Brooklyn Bridge


Several morning presentations concentrated on data security. Secure data sharing has several conflicting goals including

  1. Encourage data sharing
  2. Restricting who can see the data and who can pass on the rights to see the data
  3. Integrity of the original data including verification of the sender/originator
  4. Making protection transparent to users

Traditional approaches have concentrated on encryption that cannot be broken by outsiders and schemes so users can validate content and ensure confidential data transmission.

The following additional approaches/ideas were discusses

  1. Create a system when the data are self-protecting
  2. Enforce encryption at the object or attribute level?
  3. No security system will be water tight (except no access), so one needs to understand what are the major threats and what are the best solutions and the tradeoffs. Who can you trust?
  4. Can control be extended beyond the context of the system?
  5. Hardware Security Modules may offer more security but the raw key cannot be exported and could be hacked by the OS
  6. The use of point-to-point security systems so data is only available when needed.
  7. Point-to-point key management systems offer the possibility of understanding the usage patterns to identify security breaches

The problem is not software, it is the requirements. One wants to trust the user as much as possible and not rely on the service provider. However, in big organization, one needs a recovery key since there is premium on recovering lost data.

posted in:  databases, Open source, security    / leave comments:   No comments yet

DataDrivenNYC: #OpenSource business models, #databases, man-machine collaboration, text markup

Posted on March 16th, 2016


03/18/2016 @AXA Equitable Center, 787 7th ave, NY

20160316_182835[1] 20160316_184731[1] 20160316_192255[1] 20160316_194308[1]

The two speakers and two interviewees were

Eric Colson @StitchFix (provides fashion to you, but you don’t pick anything) spoke about their hybrid system of selecting fashion for customers. In their business

  1. You create a profile
  2. You get five hand-picked items
  3. You keep what you like, send back the rest

This means that they need to get it right first. Computers do preference modeling better than humans and humans can better understand the idiosyncrasies of other humans, so StitchFix uses both

They first send the customer’s request for clothes to a recommender system and give the output of the recommender system to a human to provide a check and customize the offering.

Next Matt Turck interviewed Eliot Horowitz @MongoDB. Mongo was started in 2007 as the founders struggled with the database needs for their new venture. Eventually they decided that the db was the most interesting part of the platform so they make it open source to friend, friends of friends, etc.

It was initially implemented using a simple storage engine. But, as the user base grew they needed a more efficient storage engine. So in Summer 2014 they acquired WireTiger – (developers formerly from Berkeley DB) to create the new storage engine. They released it in March 2015 as version3.0,

In Dec 2015, they released version 3.2 which added encryption, an improved management deployment tool and a join operator.

In Dec 2015 they also started to sell packages such as a BI connector (e.g. connect to Tableau) and Compass (graphical viewer of the data structure).

In 2016 they will add a graph algorithm in version 3.4.

Eliot then talked ways to monetize open source

  1. Consulting and support
  2. Tools – e.g. BI connection or Compass – developers not interested, but business are
  3. Cloud services – manage DB, backups, upgrades, etc.

He also mentioned that the open source matters (they release using the AGPL license), so Amazon can’t resell as a supported service.

He said that their priorities are

  1. Make current users successful
  2. Make it easy for people migrate to Mongo
  3. Develop products for the cloud, which the future of databases

Next, Kieran Snyder @textio talked about their product helps users create better recruiting notices. She noted that the effectiveness of staff searches often depends on the effectiveness of emails and broadcast job descriptions. Their tool highlights words that either help or hurt the description.

They tagged 15 million job lists according to how many people applied, how long did it take to fill, what kinds of people applied, demographic mix, etc. They looked for patterns in the text and developed a scoring system within a word processor that evaluates the text and marks it up

  1. Score 0-100. In this market with a similar job title, how fast will this role be filled
  2. Green = phrases that drives success up
  3. Red = drives success down – e.g. term over used
  4. Look at structural – e.g. formatting (e.g. best if 1/3 of content is bulleted). E.g. percent of “you” vs “we” language.
  5. Model gender tone – blue vs purple highlighting
  6. Look for boilerplate

Textio has also been used to detect gender bias in other texts.

The best feedback is from their current clients.

They retrain the model every week since language evolves over time: phrases like ‘big data’ may be positive initially and eventually become negative as everyone uses the term.

Lastly, Matt Turck interviewed Peter Fenton @Benchmark, a west coast venture firm that has backed many winners.

In a wide ranging interview, Peter made many points including

  1. He looks for the entrepreneur to have a “dream about the possible world”
  2. This requires the investor to naively suppress the many reasons why it won’t work
  3. He looks for certain personality attributes in the entrepreneur
  4. “Deeply authentic” with “reckless passion”
  5. Charismatic with an idea that destroys the linear mindset
  6. In the product he looks for
  7. Identify attributes that are radically differentiated
  8. Product fits the market and the founder fits the market
  9. Is the structure conducive to radical growth
  10. Open source as an investment has two business models
  11. Packaging model – e.g. RedHat
  12. Open core – needs
    1. Product value
    2. A big enough market to get to gain platform status. If you want mass adoption, go as long as possible without considering monetization.
  13. Business models
  14. What is the sequence in your business – do you need mass adoption first
  15. Once mass adoption is achieved you need to protect it.
  16. The biggest issue is how to deal with Amazon, who can take open source and offer it on their platform
  17. Need to license in the right way.
  18. Amazon just needs to be good enough
  19. To succeed in open source you need a direct relationship with developers to protect your business model
  20. Azure has some advantages
  21. New Nvidea GPUs so they have a performance advantage
  22. Machine learning + big data = can create a new experience, but
    1. In the NBA there is no longer an advantage in the analytics – since everyone hires the same people. Data is the advantage.
    2. If you have a unique pool of information, you can add analytics.
  23. Capital is harder to get now.
    1. One of the main issues is that some business are so well capitalized that they currently don’t need to consider their burn rate. This mispricing forces all competitors into the same model of hypergrowth at the expense of sustainable growth and infrastructure improvement.
    2. Unlike 1999-2000, the bubble is driven by institutional money. Therefore it will take longer for the bubble to burst since institutions can hide the losses for longer (they do not mark investments to the equity market prices).
    3. When valuations eventually drop, many private companies will not go public, but will be acquired by larger companies.
  24. It takes years for an opportunity to be exploited: it took 5 years to go from a camera phone to Instagram. It took years to go from GPS to Uber. It is unclear whether blockchain is in gestation and what it will give rise to (bitcoin is the best application now)


posted in:  data analysis, Data Driven NYC, databases, Open source, startup    / leave comments:   No comments yet

DataDrivenNYC: #Hadoop, #BigData, #B2B

Posted on December 15th, 2015


12/14/2015 @ Bloomberg, 731 Lexington Ave, NY

20151214_183057[1] 20151214_190526[1] 20151214_191804[1] 20151214_195703[1]

Four speakers talked about businesses that provide support for companies exploring large data sets.

M.C. Srivas, Co-Founder and CTO of MapR (Apache Hadoop distribution)

Nick Mehta, CEO of Gainsight (enterprise software for customer success)

Shant Hovsepian, Founder and CTO of Arcadia Data (visual analytics and business intelligence platform for Big Data)

Stefan Groschupf, Founder and CEO of Datameer (Big Data analytics application for Hadoop)


In the first presentation,  Shant Hovepian @ArcadiaData, CTO spoke about how to most effectively use big data to answer business questions. He summarized his points in 10 commandments for BI of big data

  1. Thou shalt not move big data – push computation close to data. Yarn, Mesos have made it possible to run a BI server next to the data. Use Mongo to create documents directly. Use native analysis engines.
  2. Thou shalt not steal or violate corporate security policies – use (RBAC) role based access control. Look for unified security models. Make sure there is an audit trail.
  3. Thou shalt not pay for every user or gigabyte – you need to plan for scalability, so be wary of pricing models that penalize you for increased adoption. Your data will grow quicker than you anticipate.
  4. Thou shalt covet thy neighbor’s visualization – publish pdf and png files. Collaboration is needed since no single person understands the entire data set.
  5. Thou shalt analyze thine data in its natural form – read free form. FIX format, JSON, tables,…
  6. Thou shalt not wait endless for thine results – understand and analyze your data quickly. Methods include building an OLAP cube, create temp tables, take samples of the data.
  7. Thou shalt not build reports instead build apps – users should interact with visual elements not text boxes or dropdowns-templates. Components should be reusable. Decouple data from the app (such as D3 does this on web pages).
  8. Thou salt use intelligent tools – look for tools with search built in. Automate what you can.
  9. Thou thalt go beyond the basics – make functionality available to users
  10. Thou shalt use Arcadia Data – a final pitch for his company.

Next, Stefan Groschupf@datameer advocated data driven decision making and talked about how the business decisions making should drive decisions make at the bottom of the hardware/software stack.

Stefan emphasized that business often do not know the analytics they will need to run, so designing a data base schema is not a first priority (He views a database schema as a competitive disadvantage). Instead he advocated starting at the hardware (which the hardest to change later) and designing up the stack.

He contrasted the processing models for YARN and Mesosphere as alternative ways of looking at processing either by grouping processes into hardware or grouping hardware to handle a single process.

Despite the advances in computer power and software, he views data as getting exponentially more complex thereby making the analyst’s job even harder. Interactive tools are the short term answer and deep learning eventually becoming more prominent.

The internet of things will offer even more challenges with more data and data sources and a wider variety of inputs.

Stefan recommended the movie Personal Gold  as a demonstration of the power of analytics – Olympic cyclists use better analytics to overcome a lack of funding versus their better funded rivals.

In the third presentation, M.C. Srivas @MapR talked about the challenges and opportunities of handing large data sets from a hardware perspective.

MapR was born from a realization that companies such as Facebook, Linkedin, etc. were all using Hadoop and building extremely large data bases. These data bases created hardware challenges and the hardware needed to be designed with the flexibility to grow as the industry changed.

M.C. gave three examples to highlight the enormous size of big data

  1. Email. One of the largest email providers holds 1.5B accounts with 200k new accounts created daily. Reliability is targeted to 6-9’s up time. Recent emails, those from the past 3 weeks, are put on flash storage. After that time they are put onto a hard disk. 95% of files are small, but the other 5% often include megabyte images and documents.
  2. Personal Records. 70% of Indians don’t have birth certificates – so these people are off the grid and as a result of open to abuse and financial exploitation. With the encouragement of the Indian government, MapR has built a system in India based on biometrics. They add 1.5mm people per day so in 2.5 years they expect everyone in India will be on the system.

The system is used for filing tax returns, airport security, access to ATMs, etc. The system replaces visas and reduces corruption.

  1. Streams. Self-driving car generate 1 terabyte of data per hour per car. Keeping this data requires massive onboard storage and edge processing along with data streamed. Long term storage for purposes of accident reporting etc. will requires large amounts of storage.

He concluded by talking about MapR’s views on open source software. They use open source code and will create open source APIs to fill gaps that they see. They view see good open source products, but advise consideration of whether an open source project is primarily controlled by a single company to avoid dependency on that company.

The presentations were concluded by Nick Mehta @Gainsight talking about providing value to B2B customers, each of whom has slightly different needs.

As a service provider to customers, there are 5 challenges

  1. Garbage in, garbage out – “what is going on before the customer leaves” – schemas change and keys are different for different departments, which part of the company are you selling to, new products and other changes over time.
  2. All customers are different – buy different products, different use cases, different amounts spent, customization
  3. Small sample size – too few data points for what you actually want to study
  4. How do you explain what the model does? vs “what percent of the time is this right?”
  5. Ease of driving change? – How do you get them to change their approach? There is a tendency for clients to look for the examples where the machine is wrong rather than looking at the bigger picture.

In response to these challenges, he made 5 suggestions

  1. Data quality – focus on recent data
  2. Data variance – analyze by segment
  3. Data points (small data) – use leading metrics. Use intuition on possible drivers: consider when user engagement drops off as a warning sign that the user might exit
  4. Model your understanding – compare proposals versus than random actions
  5. Drive change – track and automate actions

Nick concluded by saying that they use a wide variety of tools depending on the customer needs. Tools include Postgress, Mongo, Salesforce Amazon Redshift, etc.


posted in:  Big data, data, data analysis, Data Driven NYC, databases    / leave comments:   No comments yet