A field guide to the latest music AI tidal wave
Bit Rate is our member vertical on music and AI. In each issue, we break down a timely music AI development into accessible, actionable language that artists, developers, and rights holders can apply to their careers, backed by our own original research.
This issue was originally sent out under our collab research vertical known as The DAOnload.
When we first started embarking on our creative AI research sprint in October 2022, we were mesmerized by the quality and speed of AI models in the visual and text worlds, especially the likes of GPT-3, Stable Diffusion, and Midjourney. In contrast, tools for AI music generation seemed to be falling behind when it came to access to high-quality, large-scale training datasets, leading to outputs that felt overly lossy or generic.
That gap is closing in real time: 10 new AI models for music and audio generation have been unveiled in the last month alone. Some of these are home-grown models from anonymous contributors (e.g. Noise2Music and Moûsai); others are part of master’s theses; still others have come from AI research groups at big-tech juggernauts like Google and ByteDance.
The biggest improvement across the board? Fidelity. The audio quality of these latest models, in comparison to examples we were featuring just a few a weeks ago from text-to-audio models like Riffusion, has improved by leaps and bounds.
That said, it’s important to ground this latest development wave in some historical context. AI-driven music creation is nothing new — in fact, there are several AI music tools out there whose audio quality is still “better” than these latest models (e.g. AIVA, Soundful). Corporations like Apple, ByteDance, and Shutterstock have been buying out music AI companies for years, understanding the long-term commercial opportunities in making music creation easier for everyone.
So why the excitement around this latest wave?
To understand the difference, we have to look under the hood. Several music AI tools to date have worked with predetermined sounds built in — either stems and samples, or presets of virtual instruments and synthesizers that were crafted by musicians and producers. These models manipulate and pair the stems to create novel arrangements or they generate MIDI to trigger pre-made virtual instruments.
This approach leads to higher sound quality because they are either working with prerecorded material or triggering pre-made virtual sounds. It also inherently confines users’ creative possibilities to working with the preexisting audio. (One way around this is using MIDI to trigger virtual instruments, but the underlying model is still confined to the sound parameters of the different instrument presets.)
In contrast, most of the newer music AI models being revealed today, especially those building a text-to-music flow, are generating novel audio “from scratch.” How do the models know what to generate? They learn from many, many hours of training data, in the form of human-made music paired with text. The audio quality of these models’ outputs is behind, but improving rapidly. And we have seen the potential for the incredible breadth and variability of output possibilities that similar models can unlock in other creative domains, like visual art and text.
With that, let’s take a look at highlights from the most recent music model releases. I listened to hundreds of examples across these latest models, and will highlight a few here with relevant model information!
MusicLM: Generating Music From Text
This model from Google Research is the highest-fidelity music generation model I’ve encountered so far, based on the amount of noise and accurate representations of text prompts. This model understands the musician experience — the differences, for instance, among a beginning, intermediate, and professional piano player or guitarist. It can also generate geographic music location with decent accuracy.
Sample outputs:
- A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe, while being danceable.
- A meditative song, calming and soothing, with flutes and guitars. The music is slow, with a focus on creating a sense of peace and tranquility.
- A jazz and saxophone version of a whistling melody.
- An a cappella chorus version of an acoustic guitar fingerpicking melody.
SingSong: Generating Music Accompaniments from singing
More from Google Research — and using the same core model, AudioLM, as MusicLM (the model featured just above).
This model generates full tracks to accompany a capella recordings — recognizing the key of the original, and creating chords underneath that following the melody, tempo, groove, and genre. I will highlight a few here, but I recommend visiting the page and click through the “30 second samples” section.
Moûsai: Text-to-Audio with Long-Context Latent Diffusion
This model, released by anonymous researchers, still produces a decent amount of noise, which tends to be an artifact of diffusion models at large. Nonetheless, the frequency spectrum and cohesive arrangements over time are impressive.
- Prompt: Electro Swing Remix 2030 (High Quality) (Deluxe Edition) 3 of 4 — proper groove at 00:03, in this author’s opinion
- Prompt: Guitar Bass Solo Hard Rock (High Quality) 2 of 3 — the AI created a call-and-response structure between vocal and backing band. An excerpt of this in the middle of a song would be great!
- Prompt: Hip Hop, Rap Battle, 2018 (High Quality) (Deluxe Edition) 3 of 4 — really coherent vocal rhythm and backing track.
Noise2Music
The quality of these generations feels comparable to early recordings one might hear in the first half of the 20th century. That said, the complexity and coherence of the composition and arrangements is impressive, especially staying true to the detail of the text prompts. Click through to listen:
- “It is captivating, intense, mellifluous, engaging, and fervent. This music is an enthralling Sitar instrumental.”
- “The singer sings in a way that is calm and mellow, despite the message of the song suggesting that she is pleading for something. The song is a calm soulful R&B song, which has neo soul elements. The song has a slow jam style to it, and is emotional and romantic.”
- “The drums feature a light accompaniment, the piano has small interventions here and there. The jazz organ plays in low volume somewhere in the background. The atmosphere is like a dim light in a bar late at night before closing hours when everybody has left home.”
- “This is a groovy reggae song with a good vibe for dancing. The electric guitar stabs are on the off-beats and help create a bounce to the track. The vocalist is relaxed and there is an echo effect applied to her vocal.”
- “A female vocalist sings this upbeat Latin pop. The song has an upbeat rhythm with a dance groove. The drumming is lively, the percussion instruments add layers and density to the music, the bass line is simple and steady, the keyboard accompaniment adds a nice melody.”
EVEN MORE RESOURCES
Didn’t have time to drop into our Discord server this week? No worries. Stay up to date right here in your inbox with the best creative AI links and resources that our community members are sharing each week.
Shout-out to @cheriehu, @aflores, @s a r a h, @maartenwalraven, @moises.tech and @yung spielburg for curating this week’s featured links:
Music-industry case studies
- Lil Yachty’s AI-generated album cover for Let’s Start Here
- Jill Miller’s AI-generated NFT collection, in response to Ariel Pink using an image from Miller as an album cover without consent
- UK-based performing rights organization PRS for Music is streaming a debate TODAY (Feb 1) about the implications of AI on creative workers’ IP rights
- Metaphysic partners with CAA for generative AI enhancement of content from their talent roster
- Shopify begins integrating GPT-3 into its workflow automation tools
AI tools, models, and datasets
- Flavio Schneider’s repository of new music AI models — including all of the above plus the likes of RAVE2, Msanii, and VALL-E
- ElevenLabs — DIY voice AI tool
- Flawless — AI video editing
- OpenAI’s AI Text Classifier — allegedly detects ChatGPT-generated content
- AI Writing Check — independent ChatGPT content detector
Other articles and resources
- “Data Dividends” as a means of sharing AI profits with training data providers
- BuzzFeed stock soars nearly 120% on news of its deal with OpenAI to enhance content on the site
Follow more updates in Discord
- Keep an eye on our #ai-news-bulletin — our read-only channel where our research team curates the latest tweets related to creative AI news, tools, and developments, exclusively for members.
- Drop the coolest audio AI tools you find in #audio-ai-tools.
- Join the general community discussion in #ai-avatars.
If you’re not already in our Discord server, please authorize the Memberful Discord bot, which should automatically give you access. Make sure to hop in #intros and say hi once you’re in!