DataDrivenNYC: #BigData and field-specific knowledge
Posted on January 14th, 2015
Five presenters talked about variety of topics with an emphasis on how large data sets require #DataScientists using big data techniques working in conjunction with experts in the specific field of study.
In the first presentation, Shankar Mishra spoke about the steps used by @TheLadders to efficiently match candidates with job openings. Candidate resumes are split into
- Profile – who I am
- Preferences & Activities – what do I want
These characteristics are matched against job descriptions (such as salary and title) to compute the distance between candidates and each job opening. He showed how the candidate self-description and job openings change over time. For instance, the number of number of job descriptions containing the word “subprime” peaked in 2007 while the number of resumes containing the same word peaked in 2008.
Michael Karasick @ IBMWatson talked about some of the initiatives taken by IBM to commercialize the technology behind its Jeopardy-conquering Watson engine. These include
- Janary 2014 created separate entity to market Watson
- Created 100m investment fund – apply to enroll at Watson Ecosystem
- 3 minority investments in startups
- Sponsored curriculums (an contests) in 10 schools to use the tool
The Watson product concentrates on IBM’s 40 years of work on Natural Language Processing and Machine learning. NLP is emphasized since IBM has experience in machine translation, Turing task, open domain Q&A in real time. Also, most of world’s data is unstructured.
Watson/Jeopardy has been expanded from lookup of factoids to a general linguistic pipeline for other types of Q & A. Applications so far have been concentrated in a limited number of fields such as life sciences, homeland security, oncology, finance, and recipe invention. In many of these fields, the Q&A process has been expanded into discovery methods.
Cancer research has been at
- D. Anderson in Texas – clinical support for cancer patients. Feed patient record – match to literature and trained cases.
- Sloan Kettering – examine cases both synthetic and published to identify the evidence. In this case, explanations along with search results as people want to know why the system drilled down into the particular evidence reported.
Alexander Gillet @HowGoodRatings spoke about his company’s efforts to rate the sustainability of food products on grocery shelves. Ratings are based on 60 factors including environmental and social impacts of creating, processing and delivering particular food brands to the store. At several target supermarkets, ratings of up to three “stars” are placed on the tags below the item shelf.
By examining loyalty card data, they show that a great rating tends to increase dollar sales. He did not go into whether the increased revenues resulted from a higher volume of product sold or increased prices charged by the retailers for a product that consumer would consider a premium product.
Hanna Wallach @MsftResearch and UMass Amherst spoke about her research in computational social science to study social processes. She said that people are concerned their privacy when big data is analyzed since it is based on human behavior done a very granular level. Her comments were always true of any social research even before the wide availability of big-data, but bear emphasizing as more researchers and companies make use of these approaches.
She split the research into four parts: Data, questions, methods, findings. Within each of these parts she emphasized the need to have social scientists involved in the analysis as well as experts in computer science and math/statistics. These disciplines are all needed in good research to avoid emphasis on narratives and intuitions that may not be backed by previous research.
She also talked about the random nature of the behavior sampled and recommended presenting the uncertainty of the conclusions (such as confidence intervals) as well as the conclusions themselves.
Finally, Chris Wiggins, chief data scientist at NYTimes and @Columbia U. talked about his experience bringing new data science methods into a journalistic environment. He talked about needing to respect the craft of journalism and that data scientists need to be good listeners. He mentioned the challenges of bringing DS methods to the newsroom as highlighted by the Buzzfeed release of an internal evaluation of the NY Times and the friction due to the change in ownership of the New Republic.
He also talked about the many ways that people think about the concept of ‘computer literacy’ and how is boils down to how one develops a critical understanding of the limits and beneficial uses of computers in specific contexts.