Beyond Big: Merging Streaming & #Database Ops into a Next-Gen #BigData Platform
Posted on April 13th, 2017
04/13/2017 @Thoughtworks, 99 Madison Ave, New York, 15th floor
Amir Halfon, VP of Strategic Solutions, @iguazio talked about methods for speeding up a analytics linked to a large database. He started by saying that a traditional software stack accessing a db was designed to minimize the time taken to access slow disk storage. This is resulted in layers of software. Amir said that with modern data access and db architecture, processing is accelerated by a unified data engine that eliminate many of the layers. This also allows for the creation of a generic access of data stored in many different formats and a record-by-record security protocol.
To simplify development they only use AWS and only interface with Kafka, Hadoop, Spark. They are not virtualization (eventually reaches a speed limit), they do the actual store.
Another important method is to use “Predicate pushdown” =’ select … where … <predicate>’; usually all data are retrieved and then culled; instead if the predicate is pushed down, only the relevant data is retrieved. A.k.a. as an “offload-engine”.
MapR is a competitor using the HDFS database, as opposed to rebuilding the system from scratch.
#Self-learned relevancy with Apache Solr
Posted on March 31st, 2017
03/30/2017 @ Architizer , 1 Whitehall Street, New York, NT, 10th Floor
Trey Grainger @ Lucidworks covered a wide range of topics involving search.
He first reviewed the concept of an inverted index in which terms are extracted from documents and placed in an index which points back to the documents. This allows for fast searches of single terms or combinations of terms.
Next Trey covered classic relevancy scores emphasizing
tf-idf = how well a term described the document * how important is the term overall
He noted, however, the tf-idf’s values may be limited since it does not make use of domain-specific knowledge.
Trey then talked about reflected intelligence = self–learning search which uses
- Collaboration – how have others interacted with the system
- Context – information about the user
He said this method increases relevance by boosting items that are highly requested by others. Since the items boosted are those currently relevant to others, this allows the method to adapt quickly without need for manual curation of items.
Next he talked about semantic search which using its understanding of terms in the domain.
(Solr can connect to an RDF database to leverage an ontology). For instance, one can run word2vec to extract terms and phrases for a query and them determine a set of keywords/phrases to best match the query to the contents of the db.
Also, querying a semantic knowledge graph can expand the search by traversing to other relevant terms in the db
DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.
Posted on June 18th, 2016
06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)
The four speakers were
- Nitay Joffe, Founder and CTO of ActionIQ (next-generation data platform for marketing and consumer data)
- Adam Kanouse, CTO of Narrative Science (transforms data into meaningful and insightful narratives)
- Neha Narkhede, Founder and CTO of Confluent (real-time data platform built around Apache Kafka)
- Christopher Nguyen, Founder and CEO of Arimo (data intelligence platform)
Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.
They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.
Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.
Evolving from #RDBMS to #NoSQL + #SQL
Posted on May 3rd, 2016
05/03/2016 @Thoughtworks, 99 Madison Ave, 15th floor, NY
Jim Scott @MAPR spoke about #ApacheDrill which has a query language that extends ANSI SQL. Drill provides an interface that uses this SQL-extension to access data in underlying db’s that are SQL, noSQL, csv, etc.
The Ojai API has the following advantages
- Gson (in #Java) uses two lines of code to serialize #JSON to place into the data. One line to deserialize
- Idempotent – so don’t need to worry about replaying actions things twice if there is an issue.
- Drill does not requires Java, but not Hadoop so it can run on a desktop
- Schema on the fly – will take different data formats and join them together: e.g. csv + JSON
- Data is directly access from the underlying databases without needing to first transform them to a metastore
- Security – plugs into authentication mechanism of the underlying dbs. Mechanisms can go through multiple chains of ownership. Security can be done on row level and column level.
- Commands extend SQL to allow access lists in a JSON structure
- Can create views to output to parquet, csv, json formats
- FLATTEN – explode an array in a JSON structure to display as multiple rows with all other fields duplicated
#NoSQL Databases & #Docker #Containers: From Development to Deployment
Posted on April 26th, 2016
04/26/2016 @ThoughtWorks 99 Madison Ave., 15th Floor, New York, NY
Alvin Richards, VP of Product, @Aerospike spoke about employing Aerospike in Docker containers.
He started by saying that database performance demands including cache and dataLakes have made deployment complex and inefficient. Containers were developed to simplify deployment. They are similar to virtual machines, but describe the OS, programs and environmental dependencies in a standard format file. Components are
- Docker file with names + directory + processes to run to setup. OCI is the open container standard.
- Docker Compose orchestrates containers
- Docker Swarm orchestrates clustering of machines
- Docker Machine provisions machines.
Containers share root images (such as the Python image file).
Aerospike is a key value store which is built on the bare hardware (does not call the OS) for speed. It also automates data replication across nodes.
When Aerospike is run in Docker containers
- All nodes perform the same function – automated replication.
- The nodes self discover other nodes to balance the load & replication
- Application needs to understand the topology as it changes
In development, the data are often kept in the container since one usually wants to delete the development data when the development server is decommissioned. However, production servers usually don’t hold the data since these servers may be brought up and down, but the data is always retained.
#PostGresSQL conf 2016: #InternetOfThings in industry
Posted on April 19th, 2016
#PGConfus2016 #pgconfus #postgresql
Parag Goradia @GE talked about the transformation of industrial companies into software and analytics companies. He talked about three ingredients of this transformation
- Brilliant machines – reduce downtime, make more efficient = asset performance management
- Industrial big data
- People & work – the field force an service engineers
Unlike the non-industrial internet of things, sensors are already in place within machinery such as jet engines. However, until recently, only summary data when uploaded into the cloud. Industrial users are moving toward storing a much fuller data set. For instance, a single airline flight may yield 500gig of data.
The advantages of improved performance and maintenance are huge. He estimates over $ 1 trillion per year in savings. With this data collection goal, GE has created Predix, a cloud store using PostGres. Key features are
- Automated health care checks/backups
- 1 touch scalability
- No unscheduled downtime
- Protected data sets
Parag talked about selected applications of this technology
- Intelligent cities – using existing infrastructure such as street lights add sensors
- Health cloud advanced visualization – putting scans on the cloud for 3-d visualization and analytics
#PostGresSQL conf 2016: #SQL vs #noSQL
Posted on April 18th, 2016
04/18/2016 @ New York Marriott Brooklyn Bridge
The afternoon panel was composed of vendors providing both SQL and noSQL database access. The discussion emphasized that the use of a #SQL vs #noSQL database is primarily driven by
- The level of comfort developers/managers have in a SQL or noSQL
- Whether the discipline in SQL rows and fields were useful in creating applications
- Whether applications use JSON structures which are easily saved in a noSQL database
- Linking to existing applications can be done either using a SQL database or using an ORM to a noSQL database.
The available of ORM (Object-relational mapping) software blurs the lines between SQL and noSQL databases. However, one is advised to avoid using an ORM when using a noSQL database initially so one can gain familiarity in the differences between SQL and noSQL.
Both db types need planning to avoid problems and depends on the situation. For instance, sharding might be best done late when there is only a single application being developed. However, if one has many applications using the same infrastructure, one should consider specifying sharding policy early.
People want to avoid complexity, but don’t want to delegate setup to a standard default or procedure.
Controlling access can be done (even if there is no data in the object) by creating views which are accessible only by some users.
Geolocation data is best handled by specialized db’s like CardoDB: The coordinate system is not rectangular and data can be handled by sampling and aggregation.
#PostGresSQL conf 2016: #DataSecurity
Posted on April 18th, 2016
04/18/2016 @ New York Marriott Brooklyn Bridge
Several morning presentations concentrated on data security. Secure data sharing has several conflicting goals including
- Encourage data sharing
- Restricting who can see the data and who can pass on the rights to see the data
- Integrity of the original data including verification of the sender/originator
- Making protection transparent to users
Traditional approaches have concentrated on encryption that cannot be broken by outsiders and schemes so users can validate content and ensure confidential data transmission.
The following additional approaches/ideas were discusses
- Create a system when the data are self-protecting
- Enforce encryption at the object or attribute level?
- No security system will be water tight (except no access), so one needs to understand what are the major threats and what are the best solutions and the tradeoffs. Who can you trust?
- Can control be extended beyond the context of the system?
- Hardware Security Modules may offer more security but the raw key cannot be exported and could be hacked by the OS
- The use of point-to-point security systems so data is only available when needed.
- Point-to-point key management systems offer the possibility of understanding the usage patterns to identify security breaches
The problem is not software, it is the requirements. One wants to trust the user as much as possible and not rely on the service provider. However, in big organization, one needs a recovery key since there is premium on recovering lost data.
DataDrivenNYC: #OpenSource business models, #databases, man-machine collaboration, text markup
Posted on March 16th, 2016
03/18/2016 @AXA Equitable Center, 787 7th ave, NY
The two speakers and two interviewees were
- Eliot Horowitz, Founder and CTO of MongoDB (leading NoSQL database)
- Peter Fenton, General Partner at Benchmark (premier Silicon Valley vc firm; #2 on the Midas List)
- Kieran Snyder, Founder and CEO of Textio (AI-powered platform that predicts how text will perform before it’s published)
- Eric Colson, Chief Algorithms Officer at Stitch Fix (curated personal styling e-commerce platform)
Eric Colson @StitchFix (provides fashion to you, but you don’t pick anything) spoke about their hybrid system of selecting fashion for customers. In their business
- You create a profile
- You get five hand-picked items
- You keep what you like, send back the rest
This means that they need to get it right first. Computers do preference modeling better than humans and humans can better understand the idiosyncrasies of other humans, so StitchFix uses both
They first send the customer’s request for clothes to a recommender system and give the output of the recommender system to a human to provide a check and customize the offering.
Next Matt Turck interviewed Eliot Horowitz @MongoDB. Mongo was started in 2007 as the founders struggled with the database needs for their new venture. Eventually they decided that the db was the most interesting part of the platform so they make it open source to friend, friends of friends, etc.
It was initially implemented using a simple storage engine. But, as the user base grew they needed a more efficient storage engine. So in Summer 2014 they acquired WireTiger – (developers formerly from Berkeley DB) to create the new storage engine. They released it in March 2015 as version3.0,
In Dec 2015, they released version 3.2 which added encryption, an improved management deployment tool and a join operator.
In Dec 2015 they also started to sell packages such as a BI connector (e.g. connect to Tableau) and Compass (graphical viewer of the data structure).
In 2016 they will add a graph algorithm in version 3.4.
Eliot then talked ways to monetize open source
- Consulting and support
- Tools – e.g. BI connection or Compass – developers not interested, but business are
- Cloud services – manage DB, backups, upgrades, etc.
He also mentioned that the open source matters (they release using the AGPL license), so Amazon can’t resell as a supported service.
He said that their priorities are
- Make current users successful
- Make it easy for people migrate to Mongo
- Develop products for the cloud, which the future of databases
Next, Kieran Snyder @textio talked about their product helps users create better recruiting notices. She noted that the effectiveness of staff searches often depends on the effectiveness of emails and broadcast job descriptions. Their tool highlights words that either help or hurt the description.
They tagged 15 million job lists according to how many people applied, how long did it take to fill, what kinds of people applied, demographic mix, etc. They looked for patterns in the text and developed a scoring system within a word processor that evaluates the text and marks it up
- Score 0-100. In this market with a similar job title, how fast will this role be filled
- Green = phrases that drives success up
- Red = drives success down – e.g. term over used
- Look at structural – e.g. formatting (e.g. best if 1/3 of content is bulleted). E.g. percent of “you” vs “we” language.
- Model gender tone – blue vs purple highlighting
- Look for boilerplate
Textio has also been used to detect gender bias in other texts.
The best feedback is from their current clients.
They retrain the model every week since language evolves over time: phrases like ‘big data’ may be positive initially and eventually become negative as everyone uses the term.
Lastly, Matt Turck interviewed Peter Fenton @Benchmark, a west coast venture firm that has backed many winners.
In a wide ranging interview, Peter made many points including
- He looks for the entrepreneur to have a “dream about the possible world”
- This requires the investor to naively suppress the many reasons why it won’t work
- He looks for certain personality attributes in the entrepreneur
- “Deeply authentic” with “reckless passion”
- Charismatic with an idea that destroys the linear mindset
- In the product he looks for
- Identify attributes that are radically differentiated
- Product fits the market and the founder fits the market
- Is the structure conducive to radical growth
- Open source as an investment has two business models
- Packaging model – e.g. RedHat
- Open core – needs
- Product value
- A big enough market to get to gain platform status. If you want mass adoption, go as long as possible without considering monetization.
- Business models
- What is the sequence in your business – do you need mass adoption first
- Once mass adoption is achieved you need to protect it.
- The biggest issue is how to deal with Amazon, who can take open source and offer it on their platform
- Need to license in the right way.
- Amazon just needs to be good enough
- To succeed in open source you need a direct relationship with developers to protect your business model
- Azure has some advantages
- New Nvidea GPUs so they have a performance advantage
- Machine learning + big data = can create a new experience, but
- In the NBA there is no longer an advantage in the analytics – since everyone hires the same people. Data is the advantage.
- If you have a unique pool of information, you can add analytics.
- Capital is harder to get now.
- One of the main issues is that some business are so well capitalized that they currently don’t need to consider their burn rate. This mispricing forces all competitors into the same model of hypergrowth at the expense of sustainable growth and infrastructure improvement.
- Unlike 1999-2000, the bubble is driven by institutional money. Therefore it will take longer for the bubble to burst since institutions can hide the losses for longer (they do not mark investments to the equity market prices).
- When valuations eventually drop, many private companies will not go public, but will be acquired by larger companies.
- It takes years for an opportunity to be exploited: it took 5 years to go from a camera phone to Instagram. It took years to go from GPS to Uber. It is unclear whether blockchain is in gestation and what it will give rise to (bitcoin is the best application now)
DataDrivenNYC: #Hadoop, #BigData, #B2B
Posted on December 15th, 2015
12/14/2015 @ Bloomberg, 731 Lexington Ave, NY
Four speakers talked about businesses that provide support for companies exploring large data sets.
M.C. Srivas, Co-Founder and CTO of MapR (Apache Hadoop distribution)
Nick Mehta, CEO of Gainsight (enterprise software for customer success)
Shant Hovsepian, Founder and CTO of Arcadia Data (visual analytics and business intelligence platform for Big Data)
Stefan Groschupf, Founder and CEO of Datameer (Big Data analytics application for Hadoop)
In the first presentation, Shant Hovepian @ArcadiaData, CTO spoke about how to most effectively use big data to answer business questions. He summarized his points in 10 commandments for BI of big data
- Thou shalt not move big data – push computation close to data. Yarn, Mesos have made it possible to run a BI server next to the data. Use Mongo to create documents directly. Use native analysis engines.
- Thou shalt not steal or violate corporate security policies – use (RBAC) role based access control. Look for unified security models. Make sure there is an audit trail.
- Thou shalt not pay for every user or gigabyte – you need to plan for scalability, so be wary of pricing models that penalize you for increased adoption. Your data will grow quicker than you anticipate.
- Thou shalt covet thy neighbor’s visualization – publish pdf and png files. Collaboration is needed since no single person understands the entire data set.
- Thou shalt analyze thine data in its natural form – read free form. FIX format, JSON, tables,…
- Thou shalt not wait endless for thine results – understand and analyze your data quickly. Methods include building an OLAP cube, create temp tables, take samples of the data.
- Thou shalt not build reports instead build apps – users should interact with visual elements not text boxes or dropdowns-templates. Components should be reusable. Decouple data from the app (such as D3 does this on web pages).
- Thou salt use intelligent tools – look for tools with search built in. Automate what you can.
- Thou thalt go beyond the basics – make functionality available to users
- Thou shalt use Arcadia Data – a final pitch for his company.
Next, Stefan Groschupf@datameer advocated data driven decision making and talked about how the business decisions making should drive decisions make at the bottom of the hardware/software stack.
Stefan emphasized that business often do not know the analytics they will need to run, so designing a data base schema is not a first priority (He views a database schema as a competitive disadvantage). Instead he advocated starting at the hardware (which the hardest to change later) and designing up the stack.
He contrasted the processing models for YARN and Mesosphere as alternative ways of looking at processing either by grouping processes into hardware or grouping hardware to handle a single process.
Despite the advances in computer power and software, he views data as getting exponentially more complex thereby making the analyst’s job even harder. Interactive tools are the short term answer and deep learning eventually becoming more prominent.
The internet of things will offer even more challenges with more data and data sources and a wider variety of inputs.
Stefan recommended the movie Personal Gold as a demonstration of the power of analytics – Olympic cyclists use better analytics to overcome a lack of funding versus their better funded rivals.
In the third presentation, M.C. Srivas @MapR talked about the challenges and opportunities of handing large data sets from a hardware perspective.
MapR was born from a realization that companies such as Facebook, Linkedin, etc. were all using Hadoop and building extremely large data bases. These data bases created hardware challenges and the hardware needed to be designed with the flexibility to grow as the industry changed.
M.C. gave three examples to highlight the enormous size of big data
- Email. One of the largest email providers holds 1.5B accounts with 200k new accounts created daily. Reliability is targeted to 6-9’s up time. Recent emails, those from the past 3 weeks, are put on flash storage. After that time they are put onto a hard disk. 95% of files are small, but the other 5% often include megabyte images and documents.
- Personal Records. 70% of Indians don’t have birth certificates – so these people are off the grid and as a result of open to abuse and financial exploitation. With the encouragement of the Indian government, MapR has built a system in India based on biometrics. They add 1.5mm people per day so in 2.5 years they expect everyone in India will be on the system.
The system is used for filing tax returns, airport security, access to ATMs, etc. The system replaces visas and reduces corruption.
- Streams. Self-driving car generate 1 terabyte of data per hour per car. Keeping this data requires massive onboard storage and edge processing along with data streamed. Long term storage for purposes of accident reporting etc. will requires large amounts of storage.
He concluded by talking about MapR’s views on open source software. They use open source code and will create open source APIs to fill gaps that they see. They view see good open source products, but advise consideration of whether an open source project is primarily controlled by a single company to avoid dependency on that company.
The presentations were concluded by Nick Mehta @Gainsight talking about providing value to B2B customers, each of whom has slightly different needs.
As a service provider to customers, there are 5 challenges
- Garbage in, garbage out – “what is going on before the customer leaves” – schemas change and keys are different for different departments, which part of the company are you selling to, new products and other changes over time.
- All customers are different – buy different products, different use cases, different amounts spent, customization
- Small sample size – too few data points for what you actually want to study
- How do you explain what the model does? vs “what percent of the time is this right?”
- Ease of driving change? – How do you get them to change their approach? There is a tendency for clients to look for the examples where the machine is wrong rather than looking at the bigger picture.
In response to these challenges, he made 5 suggestions
- Data quality – focus on recent data
- Data variance – analyze by segment
- Data points (small data) – use leading metrics. Use intuition on possible drivers: consider when user engagement drops off as a warning sign that the user might exit
- Model your understanding – compare proposals versus than random actions
- Drive change – track and automate actions
Nick concluded by saying that they use a wide variety of tools depending on the customer needs. Tools include Postgress, Mongo, Salesforce Amazon Redshift, etc.