Google, MusicLM, and music AI’s UX problem

By: Yung Spielburg

Published: 2023-05-30

Today marks the debut of Bit Rate, our new member newsletter taking the pulse on music AI. In each issue, writer Yung Spielburg will break down a timely music AI development into accessible, actionable language that artists, developers, and rights holders can apply to their careers.

Please send feedback at any time by replying to this email, or hopping in #creative-ai in our Discord server.

Throughout our AI research, we’ve been pondering when music will get its “Midjourney moment” — namely, when high-quality song creation will become as easy for the everyday user as clicking a button.

But in playing around with MusicLM — the latest text-to-music generation model from Google — we realized that “Midjourney for music” might not even be what artists want.

Context: Why UX is important for music AI

A critical missing element from the generative music AI landscape today seems to be a smooth user experience (UX) on top of music AI models.

Because music is behind other domains like visual art when it comes to model quality, music AI developers are spending much of their energy on improving these underlying models and seemingly less on making them usable for people with no prior technical or musical knowledge.

This is fair; if the model quality is low, it doesn’t matter how great the UX is on top. Even the viral voice AI models making mainstream media headlines still require hands-on, manual music production underneath to generate a viable “deepfake song” at best, and lots of forum-digging and coding-adjacent work in Google Colab at their most involved.

That said, we are just beginning to see smoother UX rolling out with producer tools such as Bandlab, and AI model developers themselves are starting to release their own incremental UX improvements in real time.

Google’s text-to-music model MusicLM is one such example. The big-tech company has been on a music AI tear recently — rolling out four different generative music models so far this year, including MusicLM, Noise2Music, SingSong, and SoundStorm. MusicLM is the only one of these models that is available for public use, via a tightly managed waitlist that first opened on May 10, 2023. (Google first teased the model in late January via a series of published samples, which we reviewed here.)

Our research team at W&M got access to the MusicLM app, and have been playing around with it over the last few weeks with a music producer’s eye. Even though the app is still a proof-of-concept, it remains one of the highest-quality raw music synthesis experiences on the market today.

The generated audio is not high-fidelity in the way we’d expect with, say, a hit radio single. But it definitely has character, and demonstrates a deep understanding of nuances in musical differences across geographies, genres, emotions, and even performer skill levels.

Listen below to the output of the following prompt, which we wrote ourselves:

> a dancefloor-ready track that fuses soulful jazz-funk piano melodies and ambient electronic textures. Incorporate Latin-infused house beats with rhythmically complex sampling techniques. The track should feel unexpected yet cohesive, blending these elements seamlessly to inspire movement and dance.

All in all, this represents a major step forward towards practical, industry-facing use cases of this technology, especially for music supervisors, content creators, and music producers at large.

That said, there are still several limitations with MusicLM from the perspective of meeting music creators’ needs:

A. Lack of specificity

When we first started playing around with the MusicLM app, we ran into the exact same UX hurdle that we’ve seen with image generators like Midjourney: It’s easy to generate something generally cool, but difficult if not impossible to generate a very specific sound from our heads onto the page.

For instance, it’s easy with MusicLM to generate a cohesive, multi-instrumental output, with a prompt like “emotional performance in the style of Son Cubano.” But if you try to generate a specific sound you’d find in a sample library, like “solo growling trumpet,” you’ll fall short. You could try to take a swing at using a stem separator to pull out a solo sound from the multi-instrumental generated output — but at that point, the subsequent artifacts may render the usefulness of the tool rather limited.

The resulting experience feels equivalent to crate-digging with a search bar, instead of being able to dial into a specific creative vision in the context of a larger music project. Many producers in our network have mentioned how useful a text-to-sample tool would be for their day-to-day workflows — like an AI-generated version of Splice Sounds. Google seems not to be serving that use case, for now.

B. Lack of direct citations of other artists

MusicLM will not generate anything from a prompt that directly cites existing entertainment IP — be that an existing artist, song, video game, or sports team. We noticed that while lesser-known titles may slip through the cracks, large brand names are consistently met with: “Oops. Can’t generate audio for that.”

This is, of course, important legal protection for Google, especially given MusicLM’s training data foundations. An underrated piece of information from the original MusicLM paper is that the model builds upon MuLan, which spans 44 million recordings totaling 370,000 hours (~42 years) of audio, and the Free Music Archive dataset, which includes 343 days of Creative Commons-licensed audio. Due to these specs, we believe MuLan includes copyrighted material, which would lead to a legal red flag in the event that MusicLM were made fully public and explorable.

But completely removing the ability to interact with these tools with language referencing IP is arguably unnatural, and unrepresentative of how the real creative process unfolds. It’s 100% normal — in fact encouraged — for someone during a music production session to call out any number of direct cultural touchpoints: “Let’s do something with a Tarantino vibe” // “Skrillex meets Elephant Man” // “looking for the energy of The Chicago Bulls’ opening theme.”

Pop culture is a critical shared reference point for society, and direct callouts to existing IP are really effective ways of communicating across cultural boundaries.

We are not making a case to allow training on data without creators’ consent, nor are we condoning giving users the ability to reference that data in an unattributed way. But if music AI technology is really supposed to be a tool for today’s creators, they will need to be able to communicate with the tool effectively — which will require a lot of legal buttoning-up on the part of developers.

C. Loss in “text-to-music” translation

As part of our Season 3 report on creative AI, we interviewed Bronze co-founder/CEO Lex Dromgoole, who suggested that maybe “text-to-[blank]” was not the right format for music AI. While language is a powerful reference to describe music, it’s ultimately a “map” or “symbolic representation” of music that is a “diminished version of the experience itself,” and there may be a limitation in “using the map to recreate the thing you were trying to describe in the first place,” in Droomgoole’s words.

Audio-to-audio or visuals-to-audio tools may point to other creative possibilities that feel less limiting. MusicLM’s first published results had impressive timbre transfer, where humming was converted to full orchestras or acoustic guitar. Others have experimented with visuals-to-music conversion, such as this image-to-music generator that plugs in part into Mubert’s music generation model.

But the results are still experimental and rough around the edges. And looking at the limitations around MusicLM’s tool, perhaps the one-click “Midjourney-for-music” model isn’t something that artists are actually looking for — but rather something that allows for more granular exploration, precision, and control, across a wider, seemingly infinite palette.

At the same time, in our current paradigm, this approach may not be particularly inventive, but may merely expand on existing tools like Splice. In the same way that creating in a DAW would be totally foreign to a concert pianist from the 18th century, the future of music creation with AI will likely take on a very different format from what we’ve seen so far. 🤖

Alexander Flores and Cherie Hu contributed editing and fact-checking to this article.

What our members are talking about

Didn’t have time to drop into our Discord server this week? No worries. Stay up to date right here in your inbox with the best creative AI links and resources that our researchers and community members are sharing each week.

Thanks to @yung spielburg, @aflores, @Kat, @brodieconley, @deklin.eth, @Mat O, and @Gareth Simpson for curating this week’s links. You can join the community discussion anytime in the #creative-ai channel. (If you’re not already in our Discord server, click here to get access.)

Music- and entertainment-industry case studies

Universal Music Group x Endel
Ableton’s guide to AI music-making
ByteDance’s upcoming music AI creation app
Segmenting music markets using ChatGPT
Generated with Nendo, a new label of releases made with the generative music AI tool Nendo

AI tools, models, and datasets

SoundStorm (parallel audio generation model from Google)
MusicLM (text-to-music generation model from Google, covered above)
ai.txt (Spawning’s new tool for websites to set AI training permissions)
Anthropic’s 100K context windows (allowing for more content and data to be analyzed at a time — relevant for, say, generating text-based materials like legal contracts or marketing copy)
Opus Clip (turn any video into short-form clips for TikTok, IG Reels, etc.)

Legal developments

EU’s new AI Act is like “GDPR for AI”
TikTok working on disclosure to flag videos made with generative AI
OpenAI leaders propose international regulatory body for AI