#Self-learned relevancy with Apache Solr
Posted on March 31st, 2017
03/30/2017 @ Architizer , 1 Whitehall Street, New York, NT, 10th Floor
Trey Grainger @ Lucidworks covered a wide range of topics involving search.
He first reviewed the concept of an inverted index in which terms are extracted from documents and placed in an index which points back to the documents. This allows for fast searches of single terms or combinations of terms.
Next Trey covered classic relevancy scores emphasizing
tf-idf = how well a term described the document * how important is the term overall
He noted, however, the tf-idf’s values may be limited since it does not make use of domain-specific knowledge.
Trey then talked about reflected intelligence = self–learning search which uses
- Collaboration – how have others interacted with the system
- Context – information about the user
He said this method increases relevance by boosting items that are highly requested by others. Since the items boosted are those currently relevant to others, this allows the method to adapt quickly without need for manual curation of items.
Next he talked about semantic search which using its understanding of terms in the domain.
(Solr can connect to an RDF database to leverage an ontology). For instance, one can run word2vec to extract terms and phrases for a query and them determine a set of keywords/phrases to best match the query to the contents of the db.
Also, querying a semantic knowledge graph can expand the search by traversing to other relevant terms in the db
NYAI#5: Neural Nets (Jason Yosinski) & #ML For Production (Ken Sanford)
Posted on August 24th, 2016
08/24/2016 @Rise 43 West 23rd Street, NY, 2nd floorPreview Changes
Jason Yosinski@GeometricTechnology spoke about his work on #NeuralNets to generate pictures. He started by talking about machine learning with feedback to train a robot to move more quickly and using feedback to computer-generate pictures that are appealing to humans.
Jason next talked about AlexNet, based on work by Krizhevsky et al 2012, to classify images using a neural net with 5 convolutional layers (interleaved with max pooling and contrast layers) plus 3 fully connected layers at the end. The net with 60 million parameters was training on ImageNet which contains over 1mm images. His image classification Code is available on http://Yosinski.com.
Jason talked about how the classifier thinks about categories when it is not being trained to identify that category. For instance, the network may learn about faces even though there is no human category since it helps the system detect things such as hats (above a face) to give it context. It also identifies text to give it context on other shapes it is trying to identify.
He next talked about generating images by inputting random noise and randomly changing pixels. Some changes will cause the goal (such as a ‘lions’) to increase in confidence. Over many random moves, the goal increases in its confidence level. Jason showed many random images that elicited high levels of confidence, but the images often looked like purple-green slime. This is probably because the network, while learning, immediately discards the overall color of the image and is therefore insensitive to aberrations from normal colors. (See Erhan et al 2009)
[This also raises the question of how computer vision is different from human vision. If presented with a blue colored lion, the first reaction of a human might be to note how the color mismatches objects in the ‘lion’ category. One experiment would be to present the computer model with the picture of a blue lion and see how it is classified. Unlike computers, humans encode information beyond their list of items they have learned and this encoding includes extraneous information such as color or location. Maybe the difference is that humans incorporate a semantic layer that considers not only the category of the items, but other characteristics that define ‘lion-ness’. Color may be more central to human image processing as it has been conjectured that we have color vision so we can distinguish between ripe and rotten fruits. Our vision also taps into our expectation to see certain objects within the world and we are primed to see those objects in specific contexts, so we have contextual information beyond what is available to the computer when classifying images.]
To improve the generated pictures of ‘lions’, he next used a generator to create pictures and change them until they get a picture which has high confidence of being a ‘lion’. The generator is designed to create identifiable images. The generator can even produce pictures on objects that it has not been trained to paint. (Need to apply regularization to get better pictures for the target.)
Slides at http://s.yosinski.com/nyai.pdf
In the second talk, Ken Sanford @Ekenomics and H20.AI talked about the H2O open source project. H2O is a machine learning engine that can run in R, Python,Java, etc.
Ken emphasized how H2O (a multilayer feed forward neural network) provides a platform that uses the Java Score Code engine. This easies the transition from the model developed in training and the model used to score inputs in a production environment.
He also talked about the Deep Water project which aims to allow other open source tools, such as MXNET, Caffe, Tensorflow,… (CNN, RNN, … models) to run in the H2O environment.
Automatically scalable #Python & #Neuroscience as it relates to #MachineLearning
Posted on June 28th, 2016
06/28/2016 @Rise, 43 West 23rd Street, NY, 2nd floor
Braxton McKee (@braxtonmckee ) @Ufora first spoke about the challenges of creating a version of Python (#Pyfora) that naturally scales to take advantage of the hardware to handle parallelism as the problem grows.
Braxton presented an example in which we compute the minimum distance from target points a larger universe of points base on their Cartesian coordinates. This is easily written for small problems, but the computation needs to be optimized when computing this value across many cpu’s.
However, the allocation across cpu’s depends on the number of targets relative to the size of the point universe. Instead of trying to solve this analytically, they use a #Dynamicrebalancing strategy that splits the task and adds resources to the subtasks creating bottlenecks.
This approach solves many resource allocation problems, but still faces challenges
- nested parallelism. They look for parallelism within the code and look for bottlenecks at the top level of parallelism and split the task into subtasks at that level, …
- the data do not fit in memory. They break tasks into smaller tasks. They also have each task know which other caches hold data, so they can be accessed directly without going to slower main memory
- different types of architectures (such as gpu’s) require different types of optimization
- the optimizer cannot look inside python packages, so cannot optimize a bottleneck within a package.
- is a just-in-time compiler that moves stack frames from machine-to-machine and senses how to take advantage of parallelism
- tracks what data a thread is using
- dynamically schedules threads and data
- takes advantage of mutability which allows the compiler to assume that functions do no change over time so the compiler can look inside the function when optimizing execution
- is written on top of another language which allows for the possibility of porting the method to other languages
In the second presentation, Jeremy Freeman @Janelia.org spoke about the relationship between neuroscience research and machine learning models. He first talking about the early works on understanding the function of the visual cortex.
Findings by Hubel & Wiesel in1959 have set the foundation for visual processing models for the past 40 years. They found that Individual neurons in the V1 area of the visual cortex responded to the orientation of lines in the visual field. These inputs fed neurons that detect more complex features, such as edges, moving lines, etc.
Others also considered systems which have higher level recognition and how to train a system. These include
Perceptrons by Rosenblatt, 1957
Neocognitrons by Fukushima, 1980
Hierarchical learning machines, Lecun, 1985
Back propagation by Rumelhart, 1986
His doctoral research looked at the activity of neurons in V2 area. They found they could generate high order patterns that some neurons discriminate among.
But in 2012, there was a jump in performance of neural nets – U. of Toronto
By 2014, some of the neural network algos perform better than humans and primates, especially in the area of image processing. This has lead to many advances such as Google deepdream which combines images and texture to create an artistic hybrid image.
Recent scientific research allows one to look at thousands of neurons simultaneously. He also talked about some of his current research which uses “tactile virtual reality” to examine the neural activity as a mouse explores a maze (the mouse walks on a ball that senses it’s steps as it learns the maze).
Jeremy also spoke about Model-free episodic control for complex sequential tasks requiring memory and learning. ML research has created models such as LSTM and Neural Turing Nets which retain state representations. Graham Taylor has looked at neural feedback modulation using gates.
He also notes that there are similar functionalities between the V1 area in the visual cortex, the A1 auditory area, and the S1, tactile area.
To find out more, he suggested visiting his github site: Freeman-lab and looking the web site neurofinder.codeneuro.org.
DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.
Posted on June 18th, 2016
06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)
The four speakers were
- Nitay Joffe, Founder and CTO of ActionIQ (next-generation data platform for marketing and consumer data)
- Adam Kanouse, CTO of Narrative Science (transforms data into meaningful and insightful narratives)
- Neha Narkhede, Founder and CTO of Confluent (real-time data platform built around Apache Kafka)
- Christopher Nguyen, Founder and CEO of Arimo (data intelligence platform)
Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.
They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.
Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.
#TensorFlow and Cloud Machine Learning
Posted on June 7th, 2016
06/06/2016 @Audible Inc, 1 Washington Place, Newark, NJ 15th floor
Joshua Gordon @Google talked about #MachineLearning and the TensorFlow package. TensorFlow is an open source library of machine learning programs. Using the library you can manipulate tensors by defining graphs that are functions operating on the multivariate structures.
The library runs on Linux and OSx. The library runs on Windows using Docker. Support for Android is on the way. Joshua showed several applications including one that repaints an image in van Gogh’s style by merging levels from the network identifying colors from the original image with layers from a second network trained on the painter’s style.
Next, Yufeng Guo @Google talked about out-of-the-box machine learning APIs to classify images. Google has a cloud vision API and will shortly release a speech API.
The vision API imports a jpg file and outputs a description in JSON format including items identified and the confidence that the items are correctly identified. It also gives the coordinates of items identified and links to the full description of the items in Google’s database. The face detection routine also outputs information such as the rollAngle, JoyLikelihood, etc. The service is free for up to 1000 requests per month.
#Blockstack: an introduction
Posted on May 4th, 2016
05/04/2016 @ AWS popup loft, 350 West Broadway, NY
Blockstack offers secure identification based on blockchain encryption and confirmation. Six speakers described the underlying machinery and applications.
- Muneeb Ali – An Overview of Blockstack
- Jude Nelson – The Blockstack Server and CLI
- Josh Jeffryes – OpenBazaar and Blockstack Identity
- Arkadiy Kukarkin – MediaChain
- Ryan Shea – Blockstack Desktop
- Juan Benet – Blockstack Naming with IPFS
As in Bitcoin, Blockstack promises secure identification and transactions without using a central verifying agent.
The BlockStack application stack from the bottom up contains:
- Blockchain – want to use the most secure chain, which is currently bitcoin.
- Naming – intermediate pseudonym
- Identity – establish who you are
- Authentication – use electronic signature
- Storage – put pointers in the block chain, so you need storage for the actual information
- Apps built on top of the stack
- Cryptcurrency blockchain
- Virtual blockchain – gives flexibility to migrate to another cryptocurrency.
- Routing – pointers data location. Initially used a DHT (distributed hash table).
- Data on cloud servers. Could be Dropbox, S3, …
1&2 are the control plane, above that is data plane
The current implementation uses a bitcoin wallet for identity and requires 10 blockchain confirmations to set up a person.
- OpenBazaar (a place to buy and sell without an intermediary) has a long identification string for each buy/sell. Blockstack provides a secure mapping of these ids to a simpler human-readable id
- Mediachain is a data network for information on creative works in which contributed information is validated by the contributor’s identity. All objects are IPFS + IPLD objects with information saved to Merkel trees. They are working on the challenge of private key management: high volumes of registrations and the need to register on behalf of 3rd
- IPFS (interplanetary file system) proposes to
- Create a DNS based on the package content which will allow copies to be located on several locations in the network
- Greater individual control over DNS names independent of any centralized naming body
- There are three levels of naming
- The content defines the address through a hash tag. But if the blog changes, the address changes.
- Key name. a mutable layer of names that is stable even as the content changes
- DNS name. a humanly readable name that is paired with the key name
Evolving from #RDBMS to #NoSQL + #SQL
Posted on May 3rd, 2016
05/03/2016 @Thoughtworks, 99 Madison Ave, 15th floor, NY
Jim Scott @MAPR spoke about #ApacheDrill which has a query language that extends ANSI SQL. Drill provides an interface that uses this SQL-extension to access data in underlying db’s that are SQL, noSQL, csv, etc.
The Ojai API has the following advantages
- Gson (in #Java) uses two lines of code to serialize #JSON to place into the data. One line to deserialize
- Idempotent – so don’t need to worry about replaying actions things twice if there is an issue.
- Drill does not requires Java, but not Hadoop so it can run on a desktop
- Schema on the fly – will take different data formats and join them together: e.g. csv + JSON
- Data is directly access from the underlying databases without needing to first transform them to a metastore
- Security – plugs into authentication mechanism of the underlying dbs. Mechanisms can go through multiple chains of ownership. Security can be done on row level and column level.
- Commands extend SQL to allow access lists in a JSON structure
- Can create views to output to parquet, csv, json formats
- FLATTEN – explode an array in a JSON structure to display as multiple rows with all other fields duplicated
#NoSQL Databases & #Docker #Containers: From Development to Deployment
Posted on April 26th, 2016
04/26/2016 @ThoughtWorks 99 Madison Ave., 15th Floor, New York, NY
Alvin Richards, VP of Product, @Aerospike spoke about employing Aerospike in Docker containers.
He started by saying that database performance demands including cache and dataLakes have made deployment complex and inefficient. Containers were developed to simplify deployment. They are similar to virtual machines, but describe the OS, programs and environmental dependencies in a standard format file. Components are
- Docker file with names + directory + processes to run to setup. OCI is the open container standard.
- Docker Compose orchestrates containers
- Docker Swarm orchestrates clustering of machines
- Docker Machine provisions machines.
Containers share root images (such as the Python image file).
Aerospike is a key value store which is built on the bare hardware (does not call the OS) for speed. It also automates data replication across nodes.
When Aerospike is run in Docker containers
- All nodes perform the same function – automated replication.
- The nodes self discover other nodes to balance the load & replication
- Application needs to understand the topology as it changes
In development, the data are often kept in the container since one usually wants to delete the development data when the development server is decommissioned. However, production servers usually don’t hold the data since these servers may be brought up and down, but the data is always retained.
#PostGresSQL conf 2016: #InternetOfThings in industry
Posted on April 19th, 2016
#PGConfus2016 #pgconfus #postgresql
Parag Goradia @GE talked about the transformation of industrial companies into software and analytics companies. He talked about three ingredients of this transformation
- Brilliant machines – reduce downtime, make more efficient = asset performance management
- Industrial big data
- People & work – the field force an service engineers
Unlike the non-industrial internet of things, sensors are already in place within machinery such as jet engines. However, until recently, only summary data when uploaded into the cloud. Industrial users are moving toward storing a much fuller data set. For instance, a single airline flight may yield 500gig of data.
The advantages of improved performance and maintenance are huge. He estimates over $ 1 trillion per year in savings. With this data collection goal, GE has created Predix, a cloud store using PostGres. Key features are
- Automated health care checks/backups
- 1 touch scalability
- No unscheduled downtime
- Protected data sets
Parag talked about selected applications of this technology
- Intelligent cities – using existing infrastructure such as street lights add sensors
- Health cloud advanced visualization – putting scans on the cloud for 3-d visualization and analytics
#PostGresSQL conf 2016: #SQL vs #noSQL
Posted on April 18th, 2016
04/18/2016 @ New York Marriott Brooklyn Bridge
The afternoon panel was composed of vendors providing both SQL and noSQL database access. The discussion emphasized that the use of a #SQL vs #noSQL database is primarily driven by
- The level of comfort developers/managers have in a SQL or noSQL
- Whether the discipline in SQL rows and fields were useful in creating applications
- Whether applications use JSON structures which are easily saved in a noSQL database
- Linking to existing applications can be done either using a SQL database or using an ORM to a noSQL database.
The available of ORM (Object-relational mapping) software blurs the lines between SQL and noSQL databases. However, one is advised to avoid using an ORM when using a noSQL database initially so one can gain familiarity in the differences between SQL and noSQL.
Both db types need planning to avoid problems and depends on the situation. For instance, sharding might be best done late when there is only a single application being developed. However, if one has many applications using the same infrastructure, one should consider specifying sharding policy early.
People want to avoid complexity, but don’t want to delegate setup to a standard default or procedure.
Controlling access can be done (even if there is no data in the object) by creating views which are accessible only by some users.
Geolocation data is best handled by specialized db’s like CardoDB: The coordinate system is not rectangular and data can be handled by sampling and aggregation.