New York Tech Journal
Tech news from the Big Apple

NYAI#5: Neural Nets (Jason Yosinski) & #ML For Production (Ken Sanford)

Posted on August 24th, 2016

#NYAI, New York #ArtificialIntelligence

08/24/2016 @Rise 43 West 23rd Street, NY, 2nd floorPreview Changes

IMG_20160824_200640[1] IMG_20160824_203211[1]

Jason Yosinski@GeometricTechnology spoke about his work on #NeuralNets to generate pictures. He started by talking about machine learning with feedback to train a robot to move more quickly and using feedback to computer-generate pictures that are appealing to humans.

Jason next talked about AlexNet, based on work by Krizhevsky et al 2012, to classify images using a neural net with 5 convolutional layers (interleaved with max pooling and contrast layers) plus 3 fully connected layers at the end. The net with 60 million parameters was training on ImageNet which contains over 1mm images. His image classification Code is available on http://Yosinski.com.

Jason talked about how the classifier thinks about categories when it is not being trained to identify that category. For instance, the network may learn about faces even though there is no human category since it helps the system detect things such as hats (above a face) to give it context. It also identifies text to give it context on other shapes it is trying to identify.

He next talked about generating images by inputting random noise and randomly changing pixels. Some changes will cause the goal (such as a ‘lions’) to increase in confidence. Over many random moves, the goal increases in its confidence level. Jason showed many random images that elicited high levels of confidence, but the images often looked like purple-green slime. This is probably because the network, while learning, immediately discards the overall color of the image and is therefore insensitive to aberrations from normal colors.  (See Erhan et al 2009)

[This also raises the question of how computer vision is different from human vision. If presented with a blue colored lion, the first reaction of a human might be to note how the color mismatches objects in the ‘lion’ category. One experiment would be to present the computer model with the picture of a blue lion and see how it is classified. Unlike computers, humans encode information beyond their list of items they have learned and this encoding includes extraneous information such as color or location. Maybe the difference is that humans incorporate a semantic layer that considers not only the category of the items, but other characteristics that define ‘lion-ness’.  Color may be more central to human image processing as it has been conjectured that we have color vision so we can distinguish between ripe and rotten fruits. Our vision also taps into our expectation to see certain objects within the world and we are primed to see those objects in specific contexts, so we have contextual information beyond what is available to the computer when classifying images.]

To improve the generated pictures of ‘lions’, he next used a generator to create pictures and change them until they get a picture which has high confidence of being a ‘lion’. The generator is designed to create identifiable images. The generator can even produce pictures on objects that it has not been trained to paint. (Need to apply regularization to get better pictures for the target.)

Slides at http://s.yosinski.com/nyai.pdf

In the second talk, Ken Sanford @Ekenomics and H20.AI talked about the H2O open source project. H2O is a machine learning engine that can run in R, Python,Java, etc.

Ken emphasized how H2O (a multilayer feed forward neural network) provides a platform that uses the Java Score Code engine. This easies the transition from the model developed in training and the model used to score inputs in a production environment.

He also talked about the Deep Water project which aims to allow other open source tools, such as MXNET, Caffe, Tensorflow,… (CNN, RNN, … models) to run in the H2O environment.

posted in:  AI, Big data, Data science, NewYorkAI, Open source    / leave comments:   No comments yet

Automatically scalable #Python & #Neuroscience as it relates to #MachineLearning

Posted on June 28th, 2016

#NYAI: New York Artificial Intelligence

06/28/2016 @Rise, 43 West 23rd Street, NY, 2nd floor

IMG_20160628_192420[1] IMG_20160628_200539[1] IMG_20160628_201905[1]

Braxton McKee (@braxtonmckee ) @Ufora first spoke about the challenges of creating a version of Python (#Pyfora) that naturally scales to take advantage of the hardware to handle parallelism as the problem grows.

Braxton presented an example in which we compute the minimum distance from target points a larger universe of points base on their Cartesian coordinates. This is easily written for small problems, but the computation needs to be optimized when computing this value across many cpu’s.

However, the allocation across cpu’s depends on the number of targets relative to the size of the point universe. Instead of trying to solve this analytically, they use a #Dynamicrebalancing strategy that splits the task and adds resources to the subtasks creating bottlenecks.

This approach solves many resource allocation problems, but still faces challenges

  1. nested parallelism. They look for parallelism within the code and look for bottlenecks at the top level of parallelism and split the task into subtasks at that level, …
  2. the data do not fit in memory. They break tasks into smaller tasks. They also have each task know which other caches hold data, so they can be accessed directly without going to slower main memory
  3. different types of architectures (such as gpu’s) require different types of optimization
  4. the optimizer cannot look inside python packages, so cannot optimize a bottleneck within a package.

Pyfora

  1. is a just-in-time compiler that moves stack frames from machine-to-machine and senses how to take advantage of parallelism
  2. tracks what data a thread is using
  3. dynamically schedules threads and data
  4. takes advantage of mutability which allows the compiler to assume that functions do no change over time so the compiler can look inside the function when optimizing execution
  5. is written on top of another language which allows for the possibility of porting the method to other languages

In the second presentation, Jeremy Freeman @Janelia.org spoke about the relationship between neuroscience research and machine learning models. He first talking about the early works on understanding the function of the visual cortex.

Findings by Hubel & Wiesel in1959 have set the foundation for visual processing models for the past 40 years. They found that Individual neurons in the V1 area of the visual cortex responded to the orientation of lines in the visual field. These inputs fed neurons that detect more complex features, such as edges, moving lines, etc.

Others also considered systems which have higher level recognition and how to train a system. These include

Perceptrons by Rosenblatt, 1957

Neocognitrons by Fukushima, 1980

Hierarchical learning machines, Lecun, 1985

Back propagation by Rumelhart, 1986

His doctoral research looked at the activity of neurons in V2 area. They found they could generate high order patterns that some neurons discriminate among.

But in 2012, there was a jump in performance of neural nets – U. of Toronto

By 2014, some of the neural network algos perform better than humans and primates, especially in the area of image processing. This has lead to many advances such as Google deepdream which combines images and texture to create an artistic hybrid image.

Recent scientific research allows one to look at thousands of neurons simultaneously. He also talked about some of his current research which uses “tactile virtual reality” to examine the neural activity as a mouse explores a maze (the mouse walks on a ball that senses it’s steps as it learns the maze).

Jeremy also spoke about Model-free episodic control for complex sequential tasks requiring memory and learning. ML research has created models such as LSTM and Neural Turing Nets which retain state representations. Graham Taylor has looked at neural feedback modulation using gates.

He also notes that there are similar functionalities between the V1 area in the visual cortex, the A1 auditory area, and the S1, tactile area.

To find out more, he suggested visiting his github site: Freeman-lab and looking the web site neurofinder.codeneuro.org.

posted in:  AI, data analysis, Data science, NewYorkAI, Open source, Python    / leave comments:   No comments yet

DataDrivenNYC: bringing the power of #DataAnalysis to ordinary users, #marketers, #analysts.

Posted on June 18th, 2016

#DataDrivenNYC

06/13/2016 @AXA Equitable Center (787 7th Avenue, New York, NY 10019)

20160613_183900 20160613_185245 20160613_191943 20160613_194901

The four speakers were

Adam @NarrativeScience talked about how people with different personalities and jobs may require/prefer different takes on the same data. His firm ingests data and has systems to generate natural language reports customized to the subject area and the reader’s needs.

They current develop stories with the guidance of experts, but eventually will more to machine learning to automate new subject areas.

Next, Neha @Confluent talked about how they created Apache Kafka: a streaming platform which collects data and allows access to these data in real time.

Read more…

posted in:  data, data analysis, Data Driven NYC, Data science, databases, Open source    / leave comments:   No comments yet

#TensorFlow and Cloud Machine Learning

Posted on June 7th, 2016

#GDGNewark

06/06/2016 @Audible Inc, 1 Washington Place, Newark, NJ 15th floor

20160606_191056[1] 20160606_194147[1]

Joshua Gordon @Google talked about #MachineLearning and the TensorFlow package. TensorFlow is an open source library of machine learning programs. Using the library you can manipulate tensors by defining graphs that are functions operating on the multivariate structures.

The library runs on Linux and OSx. The library runs on Windows using Docker. Support for Android is on the way. Joshua showed several applications including one that repaints an image in van Gogh’s style by merging levels from the network identifying colors from the original image with layers from a second network trained on the painter’s style.

Next, Yufeng Guo @Google talked about out-of-the-box machine learning APIs to classify images. Google has a cloud vision API and will shortly release a speech API.

The vision API imports a jpg file and outputs a description in JSON format including items identified and the confidence that the items are correctly identified. It also gives the coordinates of items identified and links to the full description of the items in Google’s database. The face detection routine also outputs information such as the rollAngle, JoyLikelihood, etc. The service is free for up to 1000 requests per month.

posted in:  AI, data analysis, GDG, Open source    / leave comments:   No comments yet

#Blockstack: an introduction

Posted on May 4th, 2016

#BlockstackNYC

05/04/2016 @ AWS popup loft, 350 West Broadway, NY

20160504_192055[1]20160504_194659[1]20160504_195937[1]20160504_201917[1]

Blockstack offers secure identification based on blockchain encryption and confirmation. Six speakers described the underlying machinery and applications.

As in Bitcoin, Blockstack promises secure identification and transactions without using a central verifying agent.

The BlockStack application stack from the bottom up contains:

  1. Blockchain – want to use the most secure chain, which is currently bitcoin.
  2. Naming – intermediate pseudonym
  3. Identity – establish who you are
  4. Authentication – use electronic signature
  5. Storage – put pointers in the block chain, so you need storage for the actual information
  6. Apps built on top of the stack

Layers

  1. Cryptcurrency blockchain
  2. Virtual blockchain – gives flexibility to migrate to another cryptocurrency.
  3. Routing – pointers data location. Initially used a DHT (distributed hash table).
  4. Data on cloud servers. Could be Dropbox, S3, …

1&2 are the control plane, above that is data plane

The current implementation uses a bitcoin wallet for identity and requires 10 blockchain confirmations to set up a person.

Applications presented

  1. OpenBazaar (a place to buy and sell without an intermediary) has a long identification string for each buy/sell. Blockstack provides a secure mapping of these ids to a simpler human-readable id
  2. Mediachain is a data network for information on creative works in which contributed information is validated by the contributor’s identity. All objects are IPFS + IPLD objects with information saved to Merkel trees. They are working on the challenge of private key management: high volumes of registrations and the need to register on behalf of 3rd
  3. IPFS (interplanetary file system) proposes to
    1. Create a DNS based on the package content which will allow copies to be located on several locations in the network
    2. Greater individual control over DNS names independent of any centralized naming body
    3. There are three levels of naming
      1. The content defines the address through a hash tag. But if the blog changes, the address changes.
      2. Key name. a mutable layer of names that is stable even as the content changes

 

posted in:  applications, Blockstack, Open source, Personal Data, security, startup    / leave comments:   No comments yet

Evolving from #RDBMS to #NoSQL + #SQL

Posted on May 3rd, 2016

#SQLNYC

05/03/2016 @Thoughtworks, 99 Madison Ave, 15th floor, NY

20160503_190816[1] 20160503_192802[1] 20160503_194637[1]

Jim Scott @MAPR spoke about #ApacheDrill which has a query language that extends ANSI SQL. Drill provides an interface that uses this SQL-extension to access data in underlying db’s that are SQL, noSQL, csv, etc.

The Ojai API has the following advantages

  1. Gson (in #Java) uses two lines of code to serialize #JSON to place into the data. One line to deserialize
  2. Idempotent – so don’t need to worry about replaying actions things twice if there is an issue.
  3. Drill does not requires Java, but not Hadoop so it can run on a desktop
  4. Schema on the fly – will take different data formats and join them together: e.g. csv + JSON
  5. Data is directly access from the underlying databases without needing to first transform them to a metastore
  6. Security – plugs into authentication mechanism of the underlying dbs. Mechanisms can go through multiple chains of ownership. Security can be done on row level and column level.
  7. Commands extend SQL to allow access lists in a JSON structure
    1. CONTAINS
    2. SUM
    3. Can create views to output to parquet, csv, json formats
    4. FLATTEN – explode an array in a JSON structure to display as multiple rows with all other fields duplicated

posted in:  data, data analysis, databases, Open source, Programming    / leave comments:   No comments yet

#NoSQL Databases & #Docker #Containers: From Development to Deployment

Posted on April 26th, 2016

#SQLNYC

04/26/2016 @ThoughtWorks 99 Madison Ave., 15th Floor, New York, NY

20160426_190408[1]

Alvin Richards, VP of Product, @Aerospike spoke about employing Aerospike in Docker containers.

He started by saying that database performance demands including cache and dataLakes have made deployment complex and inefficient. Containers were developed to simplify deployment. They are similar to virtual machines, but describe the OS, programs and environmental dependencies in a standard format file. Components are

  1. Docker file with names + directory + processes to run to setup. OCI is the open container standard.
  2. Docker Compose orchestrates containers
  3. Docker Swarm orchestrates clustering of machines
  4. Docker Machine provisions machines.

Containers share root images (such as the Python image file).

Aerospike is a key value store which is built on the bare hardware (does not call the OS) for speed. It also automates data replication across nodes.

When Aerospike is run in Docker containers

  1. All nodes perform the same function – automated replication.
  2. The nodes self discover other nodes to balance the load & replication
  3. Application needs to understand the topology as it changes

In development, the data are often kept in the container since one usually wants to delete the development data when the development server is decommissioned. However, production servers usually don’t hold the data since these servers may be brought up and down, but the data is always retained.

posted in:  databases, Open source, Programming    / leave comments:   No comments yet

#PostGresSQL conf 2016: #InternetOfThings in industry

Posted on April 19th, 2016

#PGConfus2016  #pgconfus  #postgresql

04/19/2016 @ New York Marriott Brooklyn Bridge
20160419_102838[1]

Parag Goradia @GE talked about the transformation of industrial companies into software and analytics companies. He talked about three ingredients of this transformation

  1. Brilliant machines – reduce downtime, make more efficient = asset performance management
  2. Industrial big data
  3. People & work – the field force an service engineers

Unlike the non-industrial internet of things, sensors are already in place within machinery such as jet engines. However, until recently, only summary data when uploaded into the cloud. Industrial users are moving toward storing a much fuller data set. For instance, a single airline flight may yield 500gig of data.

The advantages of improved performance and maintenance are huge. He estimates over $ 1 trillion per year in savings. With this data collection goal, GE has created Predix, a cloud store using PostGres. Key features are

  1. Automated health care checks/backups
  2. 1 touch scalability
  3. No unscheduled downtime
  4. Protected data sets

Parag talked about selected applications of this technology

  1. Intelligent cities – using existing infrastructure such as street lights add sensors
  2. Health cloud advanced visualization – putting scans on the cloud for 3-d visualization and analytics

posted in:  Big data, databases, Internet of Things, Open source    / leave comments:   No comments yet

#PostGresSQL conf 2016: #SQL vs #noSQL

Posted on April 18th, 2016

#PGConf2016

04/18/2016 @ New York Marriott Brooklyn Bridge

20160418_151032[1]

The afternoon panel was composed of vendors providing both SQL and noSQL database access. The discussion emphasized that the use of a #SQL vs #noSQL database is primarily driven by

  1. The level of comfort developers/managers have in a SQL or noSQL
  2. Whether the discipline in SQL rows and fields were useful in creating applications
  3. Whether applications use JSON structures which are easily saved in a noSQL database
  4. Linking to existing applications can be done either using a SQL database or using an ORM to a noSQL database.

The available of ORM (Object-relational mapping) software blurs the lines between SQL and noSQL databases. However, one is advised to avoid using an ORM when using a noSQL database initially so one can gain familiarity in the differences between SQL and noSQL.

Both db types need planning to avoid problems and depends on the situation. For instance, sharding might be best done late when there is only a single application being developed. However, if one has many applications using the same infrastructure, one should consider specifying sharding policy early.

People want to avoid complexity, but don’t want to delegate setup to a standard default or procedure.

Controlling access can be done (even if there is no data in the object) by creating views which are accessible only by some users.

Geolocation data is best handled by specialized db’s like CardoDB: The coordinate system is not rectangular and data can be handled by sampling and aggregation.

posted in:  databases, Geolocation, Open source    / leave comments:   No comments yet

#PostGresSQL conf 2016: #DataSecurity

Posted on April 18th, 2016

#PGConf2016

04/18/2016 @ New York Marriott Brooklyn Bridge

20160418_093851[1]20160418_150214[1]

Several morning presentations concentrated on data security. Secure data sharing has several conflicting goals including

  1. Encourage data sharing
  2. Restricting who can see the data and who can pass on the rights to see the data
  3. Integrity of the original data including verification of the sender/originator
  4. Making protection transparent to users

Traditional approaches have concentrated on encryption that cannot be broken by outsiders and schemes so users can validate content and ensure confidential data transmission.

The following additional approaches/ideas were discusses

  1. Create a system when the data are self-protecting
  2. Enforce encryption at the object or attribute level?
  3. No security system will be water tight (except no access), so one needs to understand what are the major threats and what are the best solutions and the tradeoffs. Who can you trust?
  4. Can control be extended beyond the context of the system?
  5. Hardware Security Modules may offer more security but the raw key cannot be exported and could be hacked by the OS
  6. The use of point-to-point security systems so data is only available when needed.
  7. Point-to-point key management systems offer the possibility of understanding the usage patterns to identify security breaches

The problem is not software, it is the requirements. One wants to trust the user as much as possible and not rely on the service provider. However, in big organization, one needs a recovery key since there is premium on recovering lost data.

posted in:  databases, Open source, security    / leave comments:   No comments yet