Diff-MST - Differentiable Mixing Style Transfer

Soumya Sai Vanka, Christian Steinmetz, Jean-Baptiste Rolland, Joshua D. Reiss, György Fazekas


papercodeaudio


Abstract

Mixing style transfer automates the generation of a multitrack mix for a given set of tracks by inferring production attributes from a reference song. However, existing systems for mixing style transfer are limited in that they often operate only on a fixed number of tracks, introduce artifacts, and produce mixes in an end-to-end fashion, without grounding in traditional audio effects, prohibiting interpretability and controllability. To overcome these challenges, we introduce Diff-MST, a framework comprising a differentiable mixing console, a transformer controller, and an audio production style loss function. By inputting raw tracks and a reference song, our model estimates control parameters for audio effects within a differentiable mixing console, producing high-quality mixes and enabling post-hoc adjustments. Moreover, our architecture supports an arbitrary number of input tracks without source labelling, enabling real-world applications. We evaluate our model's performance against robust baselines and showcase the effectiveness of our approach, architectural design, tailored audio production style loss, and innovative training methodology for the given task. We provide code, pre-trained models, and listening examples online.

Architecture

Diff-MST, a differentiable mixing style transfer framework featuring a differentiable multitrack mixing console, a transformer-based controller that estimates control parameters for this mixing console, and an audio production style loss function that measures the similarity between the estimated mix and reference mixes


Differentiable Mixing Style Transfer

We propose a differentiable mixing style transfer system (Diff-MST) that takes raw tracks and a reference mix as input and predicts mixing console parameters and a mix as output. Our system employs two encoders, one to capture a representation of the input tracks and another to capture elements of the mixing style from the reference. A transformer-based controller network analyses representations from both encoders to predict the differentiable mixing console (DMC) parameters. The DMC generates a mix for the input tracks using the predicted parameters in the style of the given reference song. Given that our system oversees the operations of the DMC rather than directly predicting the mixed audio, we circumvent potential artefacts that may arise from neural audio generation techniques. This also creates an opportunity for further fine-tuning and control by the user.


Differentiable Mixing Console (DMC)

Multitrack mixing involves applying a series of audio effects, termed a channel strip, to each channel of a mixing console. These effects are utilized by audio engineers to address various issues such as masking, source balance, and noise. In our approach, we propose a differentiable mixing console (DMC) that integrates prior knowledge of signal processing. The DMC applies a sequence of audio effects including gain, parametric equalizer (EQ), dynamic range compressor (DRC), and panning to individual tracks, resulting in wet tracks. These wet tracks are then combined on a master bus where stereo EQ and DRC are applied to produce a mastered mix. Incorporating a master bus in the console facilitates workflow optimization, as mastered songs commonly serve as references. To enable gradient descent and training within a deep learning framework, the mixing console must be differentiable. We achieve this by utilizing differentiable effects from the dasp-pytorch library.

Pipeline of Differentiable Mixing Console


Audio Production Style Loss

The style of a mix can be broadly captured using features that describe its dynamics, spatialization, and spectral attributes. Two different losses are proposed to train and optimize the models.

Audio Feature (AF) loss: This loss comprises traditional MIR audio feature transforms. These features include the root mean square (RMS) and crest factor (CF), stereo width (SW), stereo imbalance (SI), and barkspectrum (BS) corresponding to the dynamics, spatialization, and spectral attributes, respectively. The system is optimized by calculating the weighted average of the mean squared error on the audio features that minimizes the distance between predicted mix and the reference song. For more information, refer to the paper.

MRSTFT loss: The multi-resolution short-time Fourier transform loss (Wang et al., 2019; Steinmetz et al., 2020) is the sum of \(L_1\) distance between STFT of ground truth and estimated waveforms measured in both log and linear domains at multiple resolutions, with window sizes \(W \in [512, 2048,8192]\) and hop sizes \(H =W/2\). This is a full-reference metric meaning that the two input signals must contain the same content.

Training

We train our models end-to-end using two different methods.

Method 1

We extend the data generation technique used in 1 to a multi-track scenario. A \(t=10\) sec segment is randomly sampled from input tracks, and a random mix of these input tracks is generated using random DMC parameters. The segment of the randomly mixed audio and the input tracks is split into two halves: \(M_{rA}\) and \(M_{rB}\) and \(T_A\) and \(T_B\) of \(t/2\) secs each, respectively. The model is input with \(T_B\) as input tracks and \(M_{rA}\) as the reference song. The predicted mix \(M_p\) is compared against \(M_{rB}\) as the ground truth for backpropagation and updating of weights. Using different sections of the same song for input tracks and reference song encourages the model to focus on the mixing style while being content-invariant. This method allows the use of MRSTFT loss for optimization as we have the ground truth available. The predicted mix is loudness normalized to -16.0\,dBFS before computing the loss. We train models with 8 and 16 tracks using this method with MRSTFT loss and MRSTFT loss and AF loss fine tuning.

Training strategy for Method 1


Method 2

A random number of input tracks between 4-16 for song A is sampled from a multitrack dataset, and a pre-mixed real-world mix of song B from a dataset consisting of full songs is used as the reference. The model is trained using AF loss computed between \(M_p\) and \(M_r\). This method also allows us to train the model without the availability of a ground truth. Unlike Method 1, this approach exposes the system to training examples more similar to real-world scenarios where the input tracks and the reference song come from a different song. However, due to the sampling, some input track and reference song combinations may not be realistic. We train a model with upto 16 tracks using this method using AF loss.

Results

Method RMS $$\downarrow $$ CF $$\downarrow$$ SW $$\downarrow$$ SI $$\downarrow$$ BS $$\downarrow$$ AF Loss $$\downarrow$$
Equal Loudness 3.11 0.51 3.16 0.21 33.3 33.389
MST 3.15 0.45 4.64 0.13 0.09 0.185
Diff-MST
MRSTFT-8 3.63 1.44 1.97 4.29 0.17 0.379
MRSTFT-16 3.40 0.98 1.91 1.99 0.19 0.328
MRSTFT+AF-8 3.12 0.86 1.29 0.76 0.13 0.237
MRSTFT+AF-16 3.15 0.43 0.89 2.20 0.11 0.186
AF-16 2.39 0.07 1.60 0.97 0.13 0.168
Human 1 3.02 0.26 2.05 0.46 0.17 0.218
Human 2 3.21 0.14 3.63 2.29 0.11 0.180



Audio Examples

Our Best Performing Model Diff-MST AF-16

We provide audio examples of the reference songs, the mixes generated by our best performing model Diff-MST AF-16, and Equal loudness mix (which is normalised sum of the tracks without any mixing transformation.) The audio examples are normalized to -16.0 dBFS. The reference songs maybe from different style or genre of the tracks to be mixes. This is intentional to clearly demonstrate the mixing style transfer capability of our model.


reference
AF-16
Equal Loudness


Comparision with baselines and other models

We compare the performance of our model against two human mixes, Equal loudness mix and mixing style transfer (MST) model from 2. We provide audio examples of the reference songs and the mixes generated by our model and other baselines. The audio examples are normalized to -16.0 dBFS.


Song 1

reference
Equal Loudness
MST
Human 1
Human 2


Diff-MST (Ours)


MRSTFT-8
MRSTFT-16
MRSTFT+AF-8
MRSTFT+AF-16
AF-16



Song 2

reference
Equal Loudness
MST
Human 1
Human 2


Diff-MST (Ours)


MRSTFT-8
MRSTFT-16
MRSTFT+AF-8
MRSTFT+AF-16
AF-16



Song 3

reference
Equal Loudness
MST
Human 1
Human 2


Diff-MST (Ours)


MRSTFT-8
MRSTFT-16
MRSTFT+AF-8
MRSTFT+AF-16
AF-16



References

  1. Steinmetz, Christian J., Nicholas J. Bryan, and Joshua D. Reiss. “Style transfer of audio effects with differentiable signal processing.” arXiv preprint arXiv:2207.08759 (2022). 

  2. Koo, Junghyun, et al. “Music mixing style transfer: A contrastive learning approach to disentangle audio effects.” ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.