Three design tenets for music's "Midjourney moment"

By: Cherie HuYung Spielburg

Published: 2023-09-07

This essay is co-published with the Music X newsletter.

There’s never been a more fruitful time to build in music AI.

At Water & Music, we’ve been researching the impact of AI on the music industry for the last several years, culminating in our large-scale collaborative report on the topic in February 2023. At the time, no large-scale, openly accessible, properly licensed music AI model existed, putting music far behind other creative domains like visual art and text when it came to the maturity and usability of the AI tools available. There were many reasons for this, including but not limited to a lack of publicly available music training data, the technical difficulty of generating a coherent song over a longer period of time (at the expected rate of 44,100 samples per second), and the outsized influence of lawyers in the music business writ large.

We also claimed back then that it was only a matter of time before music reached its “Midjourney moment” — making high-quality music creation as easy for the everyday user as clicking a button.

That moment has now arrived. We’re seeing an influx of stakeholders from every corner — developers, artists, rights holders — race to build large-scale music models, ship improved user experiences on top of those models, and close industry partnership deals, all at an unprecedented commercial scale and technical quality.

Developers: Google has released four different music AI models in 2023 alone (most notably MusicLM), ByteDance is testing their new AI-powered music creation tool Ripple in private beta, and Meta made the surprising decision to make their large-scale audio AI models — MusicGen for music, and AudioGen for sound effects — 100% open-source. This has set a robust foundation for generative AI music hackathons and meetups this year, including Outside LLMs at Outside Lands and Water & Music’s own music AI demo night in NYC with Betaworks.
Artists: Thanks to help from both developers and the wider music industry, artists have unprecedented access to pipelines for creating, distributing, and monetizing their own audio AI creations, without needing to spend thousands of dollars on compute resources. Most of this progress has been in voice generation, as exemplified by platforms like Grimes’ Elf Tech (GrimesAI already has ~150,000 monthly Spotify listeners) and Kits AI (which has registered nearly six million voice conversions to date). As newer models like MusicGen gain traction, we should expect similar levels of activity with music composition in the near future.
Rights holders: Every major label now has a licensing deal or partnership with a generative music AI company. Universal Music Group is the flagship partner for YouTube’s new music AI incubator, and inked a licensing deal with generative soundscape app Endel. Warner Music Group holds equity stakes in several music AI startups, including Boomy, Authentic Artists, and LifeScore. Sony Music continues to collaborate with research group SonyCSL Music on prototyping AI tools for artists, and recently hired an EVP of AI.

We’re at a critical inflection point where the music AI tools being built now will set the stage for consumer behavior and industry dynamics for years to come.

Based on our research and on-the-ground conversations at Water & Music, we’ve identified three clear tenets driving momentum and excitement for today’s music AI builders and creators — each of which has direct implications for music AI’s future tech stacks, user experiences, and business models.

In short: Music’s “Midjourney moment” will be OPEN, EXPRESSIVE, and PARTICIPATORY.

OPEN — i.e. the importance of open-source

In the world of AI, keeping technology open-source is table stakes for driving both innovation and accountability. Music AI will be no different.

Earlier this year, a leaked internal memo from Google argued that the open-source AI ecosystem will soon outpace proprietary, closed models from larger corporations like Google and OpenAI, on both technical quality and commercial implementation. The underlying argument was that keeping models free and open-source allows for faster iteration and freer information flow among researchers and developers, leading to more rapid innovation and progress, especially around areas like personalization, safety, and mobile UX. Google must open-source more of its work, the memo concluded, or otherwise risk losing its competitive advantage in the AI market.

We’re seeing the benefits of transparency — and the costs of a rigid, tightly controlled approach — play out in real time with music AI.

For instance, at the Outside LLMs hackathon, nearly all the developers onsite rallied around Meta’s open-source MusicGen text-to-audio model, building everything from DAW plugins to audioreactive AI art generators. (You can view the full list of projects here.)

In contrast, none of the teams used Google’s closed text-to-audio models (AudioLM or MusicLM) for their work. The unspoken assumption was that Meta was the clear leader in music AI and had better adoption and goodwill with the developer community, in spite of Google having a long, storied history of R&D in the sector (including some earlier open-source music tools under their project Magenta).

Similarly, at Water & Music’s own music AI demo day, nearly all of our presenters made clear that they built their projects in part on open-source tech: Never Before Heard Sounds’ Sounds Studio weaving in open-source AI models like Dance Diffusion, developer Aaron Murray building on open-source voice conversion models like sovits and RVC to build GrimesAI, and artist Ana Roman incorporating open-source tools like Riffusion and Google Magenta into her performance practice.

Of course, there’s an inherent conflict in the concept of “open-source music,” especially once monetization comes into the picture. We’ve seen a similar tension play out in other frontier tech spaces like Web3: While technologies like blockchain and AI might function best when they are open and permissionless, the same arguably doesn’t apply to music rights. In fact, the very notion of copyright is built on established, centralized systems of permissioning and attribution.

That said, with music AI, the open-source advantage will be both technical and ethical. Transparency around how AI systems are built is vital to ensure safety and fair compensation for artists and rights holders; in contrast, closed models could give their parent companies asymmetric power over shaping culture and value flows, exacerbating existing music-industry inequities.

And importantly, there does exist a middle ground between completely open and closed music AI systems in a commercial context. For instance, while Grimes’ Elf Tech platform is built on open-source models — and while technically anyone can distribute songs made with GrimesAI at their own will — the pipeline for official endorsement from (and revenue share with) Grimes on streaming services still relies on tightly controlled, gated verification from her team.

EXPRESSIVE — or, the need for creative nuance, not just automation

The most exciting and widely-adopted tools we’re seeing in music AI are not trying to create an instant banger and automate away human creativity. Rather, they’re assisting both professional and casual artists in expressing themselves in unique, nuanced ways that wouldn’t otherwise be possible with other technology.

Under this paradigm, music AI tooling is a creative collaborator, providing feedback, inspiration, and raw materials for artists to build on — like if a painter’s brush was also their brainstorming partner. As Jacky Lu, Co-Founder and Creative Director at Kaiber, articulated at Water & Music’s AI demo day: “How do you share your creative vision, and how can we make that dialogue into something a computer can understand?”

In previous research at Water & Music, we’ve argued that automation and flexibility are not necessarily mutually exclusive when building music AI tools, and that providing creative flexibility both pre- and post-generation is critical for ensuring long-term satisfaction and retention.

This interplay is on full display in the upcoming generation of music AI tools. AbleGen, one of the winners of the Outside LLMs hackathon, is structured as a plugin that allows users to generate audio within Ableton, using a combination of Max for Live and the text-to-music capabilities of Meta’s MusicGen.

This setup gives the user granular creative control on multiple levels. Not only is it still up to the user to determine how to incorporate resulting auto-generated musical fragments into a wider DAW production environment, but writing text prompts that can output specific sounds or instruments still requires a lot of creativity and context-setting on the part of the user as well. For instance, as shared at Outside LLMs, this is the text prompt required to generate a convincing, few-second string loop at 120 BPM using ChatGPT:

Hi, I am a music producer who is currently using Ableton Live to produce an instrumental track. The BPM of the song is 120bpm which means I need 3 seconds of audio to fill 1 bar. Can you please generate 12 seconds of audio using strings? The output could be of a solo string instrument such as a violin, viola, cello, or double bass. Or of a group of string players such as a quartet, or of a large group of players such as a chamber orchestra, orchestra, or philharmonic.

Similarly, while tools like Elf Tech, Kits, and Sounds Studio give users easy access to AI-generated voices, it is ultimately on humans to provide the wider creative vision in these cases, weaving vocal generations into their own holistic compositions and style. Case in point: We released our own GrimesAI song “Eggroll” earlier this summer, where we used AI only for vocal transformation, leaving every other step of the production process — including music composition, arrangement, lyric writing, mixing, and mastering — up to us.

Keeping the artist in the driver’s seat will certainly be a requirement for adoption when it comes to music-industry partnerships — and it matches up to how artists are naturally using AI in industry dealings, anyway. For example, as Billboard recently reported, songwriters are beginning to experiment with using AI to craft demos with AI clones of artists’ voices already built-in, in the hopes of improving the pitching process by enabling said artists to envision themselves on the track more clearly.

In these cases, while AI streamlines the creative process, the pitch and business deal are still driven by the ultimate purpose of representing an artist’s or songwriter’s holistic creative vision, and helping multiple human beings better collaborate together.

In today’s music industry, fans are no longer merely passive observers or consumers; they are increasingly the pilots steering the course of culture, and even chart performance.

Generative AI accelerates this trend by serving as a conduit for any fan on the Internet to contribute to and be recognized officially in an artist’s creative universe. As James Pastan, co-founder of Arpeggi Labs (the maker of Kits AI), shared at Water & Music’s demo day: “Fandom is becoming more intimate and collaborative. Fans want to do more than consume content: They want to be involved in the creation of and story behind that content, commercial or otherwise.”

Put another way, the long-term impact of AI will be as social as it is legal or aesthetic — making the fluid remixing of fanfiction and the bottom-up participatory dynamics of platforms like TikTok the norm, rather than the exception, in cultural economies at large.

As music AI models reach a functional level of technical quality, now is the time to study and experiment with emerging, bottom-up forms of fan behavior around AI tools — particularly around using AI to strengthen fan-to-fan and artist-to-fan relationships and amplify, rather than erode, our appreciation of art.

Grimes’ Elf Tech platform has set a strong precedent for how future AI fandoms could look under the hood — not only facilitating a creative dialogue between artists and fans, but also aligning financial incentives around the hundreds of subsequent songs created.

This multi-way, participatory approach doesn’t have to apply just to audio. For instance, at Outside LLMs, there was no shortage of teams working on tools for automating visual content like music videos, lyric videos, and other general promo assets. Aside from streamlining the digital marketing process for artists, these use cases could potentially foster a new paradigm for fan art, opening up new creative avenues for fans to show their dedication to an artist and contribute to their success in multimedia ways.

At the dawn of music’s Midjourney moment, the music industry’s great challenge and privilege will be to wield the onslaught of upcoming AI innovations both imaginatively and responsibly. By embracing these three design tenets of openness, expressivity, and participation, the next generation of music AI builders, creators, and partners can lean into the unique characteristics that already make both the music and AI industries flourish, while also paving paths for uncharted territories in creativity and business.

Three design tenets for music's "Midjourney moment"

OPEN — i.e. the importance of open-source

EXPRESSIVE — or, the need for creative nuance, not just automation

PARTICIPATORY — or, the social extension of creative universes