Stem Splitter AI

Introduction

Stem splitting is the process of isolating different parts of a song into its individual pieces, typically vocal, percussive, and instrumental parts. This is useful for music creation, sampling, and remixing because it gives musicians the ability to create different aspects of music than what the original author intended. Our goal is to separate the vocals from the rest of a piece. Below is an example of a stem split of vocals and a comparison with the original piece:

Original Song

Vocals

The Applications of Machine Learning in Stem Splitting

One of the main challenges of stem splitting is isolating the correct frequencies. For each instrument or voice, there are a set of undertones and overtones that give it its unique sound, otherwise known as timbre. To isolate a part from the rest of the song, we must correctly identify the overtones and undertones associated with the part.

In addition, the frequencies from each of these parts often overlap with each other, meaning separation is not as simple as recognizing that an instrument is playing and completely eliminating every frequency that the instrument plays. This complex set of problems makes this a well suited situation for a neural network because a neural network will learn to identify and isolate the frequencies associated with vocalization.

Designing a machine learning model which separates these parts is difficult, but not impossible. Many groups, such as SigSep and Google’s Magenta Research group, have successfully used LSTM neural networks to recognize patterns in vocals and instrumentation. Once the algorithm successfully isolates these frequencies, they can be filtered out and separated accordingly.

Our General Method

Below is an overview of the general steps we took to create our final models. We developed our method architecture with guidance from SigSep, an existing signal-separation model built in PyTorch. Our general method is to use a neural network to generate a mask to filter out every audio frequency that is not vocals.

Our methodology can be split into three parts: converting audio to frequency-domain tensors, building our dataset and applying our model, and converting back our results to audio.

Data Preprocessing

A useful way to represent audio is to analyze its frequencies over time, creating what is called a spectrogram. This allows us to easily reduce different amounts of each frequency by simply applying a filter “mask” to the spectrogram. For preprocessing our stem files, we use the built-in Short-Time Fourier Transform method in PyTorch, which creates a spectrogram that can be passed into the neural network.

Building the Dataset

To build a neural network with the architecture that we created, we must also create a dataset that fits our model. We used the MUSDB dataset of music and its stems as a basis, applying an STFT to the music as the input to the network, and calculating an ideal filter (a mask of values that when multiplied by the spectrogram of the song output only the vocals) to use as the neural network’s target.

Converting Results to Audio

Once the neural network has created a mask for the audio, we can apply the mask to the original spectrogram and turn the spectrogram back into audio. If the neural network performs well, the output will be only the vocals from the original music track.

The Multi-Layer Perceptron Model

As a baseline, we created a simple three-layer Multilayer Perceptron. This model consists of several dense layers with Rectified Linear Unit (ReLU) activation functions after each layer.

We wanted to create this model as a way to better understand the impact of adding an LSTM layer, which we explore in the next iteration of our model. This was also a good starting point for us because of the complexity of the data we were working with.

An Introduction to LSTM

Long short-term memory (LSTM) is a type of recurrent neural network used to process data in sequences, such as audio or video. We will walk through how these networks work, and how they are applicable to audio processing.

Source:

LSTM works by connecting a series of different inputs to the network in a chain of LSTM cells which pass memory through the series. Each cell determines what memory to pass onto the next cell based on its piece of data, the previous piece of data, and the memory that was passed to it. If these inputs are pieces of data across time, then the LSTM will determine the output of the current state in part from the previous states.

The LSTM cells accomplish this behavior by having “gates”, which based on data and trained weights decide whether to keep previous memory (known as “cell state”), decide whether to store data into the cell state, and finally passes on the cell state to the next cell.

Source: Understanding LSTMs, Colah's Blog

As pictured in the diagram, gates are simply the sigmoid expression multiplying a term because the sigmoid expression squashes values between one and zero, and multiplying one or zero by an expression allows data to pass or block it through the gate.

LSTM Applications for Audio Processing

LSTM networks are especially useful for dealing with inputs that change over time because as their name suggests, they perform long-term and short-term memory on a piece of data, picking up on both immediate and long-term context.

LSTM networks have now become the most common method for neural network audio processing. This is because audio relies on events that happen over time. In fact, music is much more about the change in frequencies over time than the specific frequencies themselves. The LSTM network, therefore, should introduce fewer audio artifacts because unlike a Multilayer Perceptron, the LSTM network has temporal stability (each output should make sense relative to the previous).

For our LSTM neural network, we simply added a single unidirectional LSTM layer to three fully connected layers. This architecture is very similar to our Multilayer Perceptron model, but with the significant added complexity that the data being handled by the network now has the extra dimension of time.

Results

Rock Music (Original)

Rock Music with MLP

Rock Music with LSTM

As pictured in the spectrograms, both the Multilayer Perceptron and the LSTM networks significantly filtered the original audio. We can hear what these spectrograms sound like in the next section!

Folk Music

Original:

MLP Model:

LSTM Model:

Actual Vocals:

Rock Music

Original:

MLP Model:

LSTM Model:

Actual Vocals:

Percussive Music

Original:

MLP Model:

LSTM Model:

Actual Vocals:

Interpretation

Admittedly, the audio that was passed through both of the networks came out very distorted. However, qualitatively the result from the LSTM network does have less distortion and does seem to reduce percussion and bass significantly. This better performance can likely be attributed to LSTM’s temporal stability.

A significant portion of our non-spectacular results can be attributed to the models not being trained to completion. Since we are dealing with a lot of data, processing, and intense neural networks, we were only able to train our networks for a small number of epochs. We likely would have seen better results if we trained for longer and had faster hardware.

Another challenge of our model was handling exploding gradients. We adjusted this by decreasing our learning rate, but by doing so we could have sacrificed some of the training capabilities of our model. We found that data normalization helped control exploding gradients. If we had time, we could have debugged this issue further by evaluating the weights coming in and out of each layer.

We know that for models such as OpenUnmix’s model library and Magenta’s music generation model, multiple LSTM layers are defined in the model. Our basic LSTM model had only one LSTM layer so we could learn more about the effect of adding more layers.

Resources

Understanding LSTMs, Colah's Blog

SigSep Tutorials, Current Trends in Audio Separation

SigSep MUSDB Dataset

OpenUnmix PyTorch Library