#Unsupervised Learning (Soumith Chintala) & #Music Through #ML (Brian McFee)
Posted on July 26th, 2016
07/25/2016 @Rise, 28 West 24rd Street, NY, 2nd floor
Two speakers spoke about machine learning
In the first presentation, Brian McFee @NYU spoke about using ML to understanding the patterns of beats in music. He graphs beats identified by Mel-frequency cepstral coefficients (#MFCCs)
Random walk theory combines two representations of points in the graph.
- Local: In the graph, each point is a beat, edge connect adjacent beats. Weight edges by MFCC .
- Repetition: Link k-nearest neighbor by repetition = same sound – look for beats. Weight by similarity (k is set to the square root of the number of beats)
- Combination: A = mu * local + (1-mu)*repetition; optimize mu for a balanced random walk , so probability of a local move – probability of a repetition move over all vertices. Use a least squares optimization to find mu so the two parts of the equation make equal contributions across all points to the value of A.
The points are then partitioned by spectral clustering: normalize Laplacian – take bottom eigenvectors which encode component membership for each beat; cluster the eigenvectors Y of L to reveal the structure. Gives hierchical decomposition of the time series. m=1, the entire song. m=2 gets the two components of the song. As you add more eigenvectors, the number of segments within the song increases.
Brain then showed how this segmentation can create compelling visualizations of the structure of a song.
The Python code used for this analysis is available in the msaf library.
He has worked on convolutional neural nets, but find them to be better at handing individual notes within the song (by contrast, rhythm is over a longer time period)
In the second presentation, Soumith Chintala talked about #GenerativeAdversarialNetworks (GAN).
Generative networks consist of a #NeuralNet “generator” that produces an image. It takes as input a high dimensional matrix (100 dimensions) of random noise. In a Generative Adversarial Networks a generator creates an image which is optimized over a loss function which evaluates “does it look real”. The decision of whether the image looks real is determined by a second neural net “discriminator” that tries to pick the fake image from a set of other real images plus the output of the generator.
Both the generator and discriminator NN’s are trained by gradient descent to optimize their individual performance: Generator = max game; discriminator = min game. The process optimizes Jensen-Shannon divergence.
Soumith then talked about extensions to GAN. These include
Class-conditional GANS – take noise + class of samples as input to the generator.
Video prediction GANS –predict what happens next given the previous 2 or 3 frames. Added a MSE loss (in addition to the discriminator classification loss) which compares what happened to what is predicted
Deep Convolution GAN – try to make the learning more stable by using a CNN.
Text-conditional GAN – input =noise + text. Use LSTM model on the text input. Generate images
Disentangling representations – InfoGAN – input random noise + categorical variables.
GAN is still unstable especially for larger images, so work to improve it includes
- Feature matching – take groups of features instead of just the whole image.
- Minibatch learning
No one has successfully used GAN for text-in to text-out
The meeting was concluded by a teaser for Watchroom – crowd funded movie on AI and VR.