A field guide to the latest music AI tidal wave

Bit Rate is our member vertical on music and AI. In each issue, we break down a timely music AI development into accessible, actionable language that artists, developers, and rights holders can apply to their careers, backed by our own original research.

This issue was originally sent out under our collab research vertical known as The DAOnload.


When we first started embarking on our creative AI research sprint in October 2022, we were mesmerized by the quality and speed of AI models in the visual and text worlds, especially the likes of GPT-3, Stable Diffusion, and Midjourney. In contrast, tools for AI music generation seemed to be falling behind when it came to access to high-quality, large-scale training datasets, leading to outputs that felt overly lossy or generic.

That gap is closing in real time: 10 new AI models for music and audio generation have been unveiled in the last month alone. Some of these are home-grown models from anonymous contributors (e.g. Noise2Music and Moûsai); others are part of master’s theses; still others have come from AI research groups at big-tech juggernauts like Google and ByteDance.

The biggest improvement across the board? Fidelity. The audio quality of these latest models, in comparison to examples we were featuring just a few a weeks ago from text-to-audio models like Riffusion, has improved by leaps and bounds.

That said, it’s important to ground this latest development wave in some historical context. AI-driven music creation is nothing new — in fact, there are several AI music tools out there whose audio quality is still “better” than these latest models (e.g. AIVA, Soundful). Corporations like Apple, ByteDance, and Shutterstock have been buying out music AI companies for years, understanding the long-term commercial opportunities in making music creation easier for everyone.

So why the excitement around this latest wave?

To understand the difference, we have to look under the hood. Several music AI tools to date have worked with predetermined sounds built in — either stems and samples, or presets of virtual instruments and synthesizers that were crafted by musicians and producers. These models manipulate and pair the stems to create novel arrangements or they generate MIDI to trigger pre-made virtual instruments.

This approach leads to higher sound quality because they are either working with prerecorded material or triggering pre-made virtual sounds. It also inherently confines users’ creative possibilities to working with the preexisting audio. (One way around this is using MIDI to trigger virtual instruments, but the underlying model is still confined to the sound parameters of the different instrument presets.)

In contrast, most of the newer music AI models being revealed today, especially those building a text-to-music flow, are generating novel audio “from scratch.” How do the models know what to generate? They learn from many, many hours of training data, in the form of human-made music paired with text. The audio quality of these models’ outputs is behind, but improving rapidly. And we have seen the potential for the incredible breadth and variability of output possibilities that similar models can unlock in other creative domains, like visual art and text.

With that, let’s take a look at highlights from the most recent music model releases. I listened to hundreds of examples across these latest models, and will highlight a few here with relevant model information!


MusicLM: Generating Music From Text

This model from Google Research is the highest-fidelity music generation model I’ve encountered so far, based on the amount of noise and accurate representations of text prompts. This model understands the musician experience — the differences, for instance, among a beginning, intermediate, and professional piano player or guitarist. It can also generate geographic music location with decent accuracy.

Sample outputs:


SingSong: Generating Music Accompaniments from singing

More from Google Research — and using the same core model, AudioLM, as MusicLM (the model featured just above).

This model generates full tracks to accompany a capella recordings — recognizing the key of the original, and creating chords underneath that following the melody, tempo, groove, and genre. I will highlight a few here, but I recommend visiting the page and click through the “30 second samples” section.


Moûsai: Text-to-Audio with Long-Context Latent Diffusion

This model, released by anonymous researchers, still produces a decent amount of noise, which tends to be an artifact of diffusion models at large. Nonetheless, the frequency spectrum and cohesive arrangements over time are impressive.


Noise2Music

The quality of these generations feels comparable to early recordings one might hear in the first half of the 20th century. That said, the complexity and coherence of the composition and arrangements is impressive, especially staying true to the detail of the text prompts. Click through to listen:


EVEN MORE RESOURCES

Didn’t have time to drop into our Discord server this week? No worries. Stay up to date right here in your inbox with the best creative AI links and resources that our community members are sharing each week.

Shout-out to @cheriehu, @aflores, @s a r a h, @maartenwalraven, @moises.tech and @yung spielburg for curating this week’s featured links:

Music-industry case studies

AI tools, models, and datasets

Other articles and resources

Follow more updates in Discord

If you’re not already in our Discord server, please authorize the Memberful Discord bot, which should automatically give you access. Make sure to hop in #intros and say hi once you’re in!