As machine learning practitioners explore new ways to create art, introducing these innovations to the art community becomes essential. Our project aims to do just that: make it easy for choreographers and dancers to use machine learning models that create dance from music.
DanceMuse is the union of choreography and machine learning

New artificial intelligence (AI) research is finding its way into an unexpected part of society: dance

It comes as no surprise that computers are deeply involved in art. Photographers have been using computers to edit their work for ages, digital designers focus on creating art intended to be enjoyed on computers, and singers use computers to modify and enhance their voices. Up until now, however, choreographers have been using computers solely for gaining inspiration, sharing their work, and possibly for arranging music.

Machine learning (ML) breaks us out of this box. Where before choreographers and dancers could only look at videos and physically try out how different moves would look, ML now enables us to do much much more.

It opens us to a new world of possibilities and allows us to ask questions we never dared ask before:

  • How would Michael Jackson dance to today’s music?
  • What would a J-pop dance look like if choreographed to the sounds of the NY subway?
  • What genre of dance lies between ballet and hip-hop? How does it look like?
  • How would a ballet dance look like if it was choreographed to a song by The Weeknd?
  • Can we make a dance based on cartoons?

Project Overview

Dataset Collection

Compile unique datasets embodying various dance styles, genres of music, and actual dancers.

Backend Computing

Adapt and deploy a state-of-the-art dance generation model to create long and expressive dance sequences.

User Experience

Streamline user interaction with the model so that non-experts can easily train the model on their own datasets and test it on new music.

Music to Dance - An Overview

Just like any other machine learning model, an input dataset is required to be fed into the model. There are two parts of the preprocessed dataset here: one is the pose extraction, where tools such as OpenPose can be used to extract human body movements from video as key joints. As for the audio, there are packages like librosa and essentia to extract music features, such as the music beats matching and mel frequency cepstral coefficients, also known as MFCCs. These two parts of the dataset are then sent to a deep learning model to learn the relationship between the audio and the poses, which can broadly be viewed as treating the music as features and the poses as labels. Once the model is trained, given a new piece of music, the model is able to generate a smooth, natural-looking, style-consistent, and beat-matching dance. Our challenge then becomes creating a model that incorporates this AND allows for interaction with an artist.

Example of pose extraction

Deep Learning for Generating Dance

Our work builds off a pre-developed machine learning model named DanceRevolution (paper here) as the basis for our system. DanceRevolution uses an encoder-decoder architecture; at a high level, the encoder learns to 'store' music information in a lower dimensional space, and the decoder learns a mapping from the encoded coordinates into dance positions.

Overview of Model Architecture, from DanceRevolution

Encoding

The music is encoded using a transformer with k-nearest neighbors self attention. Self attention means that at every timestep, the model is able to look at the music features before and after in time to determine the encoding for the current music timestep. K-nearest neighbors refers to how far into the past or the future the model can look into. Self attention is great for this type of project because it allows the network to learn about relationships between music features across time while also considering the effects one time step has on another (much like in NLP where past and future words can dictate the meaning of the current word).


Decoding

Historically, the issue with autoregressive decoders for generating dance has been the error accumulation caused by exposure bias. Generative dance models are trained on dance poses from the provided ground-truth poses, but at inference time, the decoder must predict a new pose based on the previous pose, which was also predicted. Essentially, the model is not exposed to its own prediction errors during training, so as a result, the biases of the predicted motion at each time-step during inference time will accumulate through the dance sequence. Since a 1-minute long dance sequence at 30 FPS is composed of 1800 different poses, the error accumulation can be quite severe.


What attracted us to Dance Revolution was its curriculum learning strategy to alleviate this exposure bias problem. During training, this strategy gradually transitions a fully guided teacher-forcing scheme, which means that the decoder predicts using the ground-truth poses, into an autoregressive scheme, where the decoder predicts off its own prediction at the previous time-step. The right hand side of this diagram shows that the input dance sequence that the model is trained on alternates between ground-truth sub-sequences and auto regressively predicted sub-sequences, which increase in length over the course of training to gradually minimize exposure bias. This strategy allows for longer length dance generation, compared to older models whose output would freeze after a few seconds.


Below are the outputs of a model trained on a ballet dataset after differing numbers of epochs when tested on new music. The leftmost video is produced by a model trained over 2,000 epochs, and as evident in the jitteriness and lack of correspondence between the beat and the moves, this model does not have the sufficient capacity to generate realistic dance to the test music. The center video is produced by a model trained over 6,000 epochs. The movements of the key joints are noticeably smoother and more expressive, and the choreography better matches the music. The rightmost video is produced by a model trained over 10,000 epochs. It resembles the under-trained model in its jitteriness, in addition to its repeating of the same spinning move. The regression of the model output after the additional 4,000 epochs likely indicates overfitting, where training has caused the model to fit too closely to the training dataset and limits its capacity to generalize to new music.

User Experience

Our current work has been to create a simple command-line based tool to train and test models. The end goal, however, has always been to create a web-based interface where people can upload audio clips and have the model generate dances for them. We plan on soon adding a demo feature to this website to do just that by incorporating gradio to interface with the ML model seamlessly.

Our Contributions

Our work makes music to dance generation models accessible to a wider audience than ever before by easing their use and creating templates for some of their possible uses.

In a few words, our work so far has been focused on:

  • Debugging model preprocessing
  • Streamlining preprocessing and testing processes
  • Implementing model fine-tuning to reduce training time
  • Training a set of pre-trained models to supplement fine-tuning feature and reduce the need to fully train models
  • Creating a simple command-line tool to perform all aspects of the dance-generating process (preprocessing, training, and testing)

Trained models

In works... Come back anytime to check out for new outputs :)

About US

We are a group of creative Engineers!

Jason Kurian

I am a senior electrical engineering student in the computer engineering track at the Cooper Union. I am deeply interested in machine learning, data science, and software development. As the captain of Cooper's dance team, the premise of this project immediately peaked my interest, and I cannot wait for it to be actively used by the team. You can check out more of my work on GitHub and connect with me via LinkedIn.

Yuval Epstain Ofek

I am a senior electrical engineering student at the Cooper Union concurrently working towards a master’s degree. I love signal processing, mathematics, & machine learning and take pleasure in applying theories from these fields to my projects. Check out some of my projects on my Github here, connect via LinkedIn here, or send me an email! There is nothing better than chatting and sharing my work with others :)

Crystal Yuecen Wang

I am a senior electrical Engineering student at the Cooper Union with a computer engineering focus track. I love web development, software engineering, artificial intelligence, & machine learning ! Feel free to checkout some of my projects on my Github. During my spare time, I love music, cooking, netflix, and traveling :)