Voices "R" Us: The future of voice models in the music business

Bit Rate is our member vertical on music and AI. In each issue, we break down a timely music AI development into accessible, actionable language that artists, developers, and rights holders can apply to their careers, backed by our own original research.

This issue was originally sent out under our collab research vertical known as The DAOnload.


One major theme in our creative AI research is that music remains far behind visual art and text when it comes to the quality and maturity of AI tooling available. There are many reasons behind this lag, from a severe lack of training data to the outsized influence of copyright lawyers in the music business.

That said, there’s one area of audio AI that is already quite mature, and making a significant mark on the music business: Voice synthesis — or the process of generating artificial voices straight from text.

At large, the concept of computer-generated voices guiding our lives is nothing new — especially thanks to the popularity of voice assistants like Siri and Alexa, which were released in 2011 and 2014, respectively. Even music tools that leverage voice AI predate modern voice assistants: Yamaha released its first vocaloid product in 2004, and the iconic Hatsune Miku had her first release in 2007.

That said, vocaloid software is… tedious, to say the least. The level of hands-on manipulation needed to get to a solid result is akin to the meticulous production processes of tuning and retuning vocals or toying with parameters on a synth — versus what is typically involved in auditioning an idea, which involves just singing it into a mic. As W&M member @robcamp hasshared in the #ai-avatars channel of our Discord server, the process of working with vocaloid software remains a bit fragmented, still requiring an experienced music producer to create the vocaloid tracks and a separate songwriter (or GPT-3) to write the lyrics.

Still, the software’s most famous virtual avatar representatives, including Hatsune Miku, have millions of YouTube subscribers and have sold out stadiums around the world.

The new generation of voice modeling for artists

The acceleration of AI development in the last six months points to a new, more seamless generation of voice synthesis. Several tools exist today that allow artists and brands alike to generate high-quality, convincing AI voices with just minutes of training data.

In past issues of the DAOnload, we’ve mentioned several case studies of major artists, rights holders, and tech companies investing in or being impacted directly by voice AI. Just a handful of recent examples:

Some of the above examples might sound kitschy — but the ethical and legal questions that they pose to the music industry certainly are not.

In music, we already have robust, scaled infrastructure to facilitate consent, attribution, and compensation for authors of sampled works — not so for modeling someone’s work with AI. It is one thing to sample audio directly, but another thing entirely to create in someone’s likeness. This is a completely different paradigm where consent and authenticity are ever more important, perhaps even more so than attribution and compensation. For this reason, leading artists and developers use different terms from sampling to describe the creative work involved (e.g. “spawning,” in the case of Mat Dryhurst and Holly Herndon).

It’s tempting to look at the history of sampling and map a similar trajectory onto “spawning” — namely, to assume that once infrastructure for consent, authenticity, attribution, and compensation get put into place, another creative and commercial boom must be around the corner.

There’s a core ethical issue that complicates this comparison: While sampling puts one’s existing work into new context, spawning provides the possibility of being portrayed doing something that one never did, saying something one never said. Even with full permission, that concept is rightfully scary to many in the music industry. However, it is instructive to see artists finding creative ways to lean into this new technology and exercise some control while they still can.

Our Season 3 research on AI business models — and specifically the contributions of one of our core analysts Kristin Juel (@Juel Concepts) — speaks to how artists would do well to proactively assemble quality data to train and fine-tune their own personal AI models in the future, as the technology becomes more accessible and affordable. One immediate use for this technology would be karaoke bars — imagine not only singing a Beyoncé song, but singing it in Beyoncé’s voice.

One caveat is that the winner-take-all power law that we’re used to seeing across other music-industry revenue streams could very well apply to voice AI monetization, where the high demand for big-brand artists’ models crowds out commercial appetite for lesser-known artists.

Listen: Water & Music’s own voice AI experiment

At Water & Music, our core team has been making use of voice AI as an experiment to scale the reach of our research and writing on a lower budget — with an AI voice model of Cherie Hu narrating our ongoing Starter Pack series. You can listen to our AI recordings to date on our SoundCloud page.

Our AI explorations at large have been pioneered and executed by our tech and strategy lead Alex (@aflores), whooutlined the process of building Cherie’s voice clone in this Twitter thread. For voice cloning, our tech stack included tools like Adobe‘s Podcast Enhancer for audio upscaling and ElevenLabs‘ off-the-shelf tool for text-to-speech generation.

Are these tools really more efficient? It depends on what the optimization function is. Does it save Cherie time? Yes. Did it take Alex longer to create a usable recording than it would for Cherie to simply record? Perhaps upfront in terms of preparing initial training data, but certainly not in the long term as the technology improves.

Regardless, initial feedback from the community has suggested that these audio offerings of our newsletters offer a genuinely value-added and time-saving way to consume our research, not to mention a meaningful way to experience newsletters thru the lens of Cherie’s voice. This ties back to how voice models, when deployed ethically and consensually, can be meaningful extensions of people’s brands.


EVEN MORE RESOURCES

Didn’t have time to drop into our Discord server this week? No worries. Stay up to date right here in your inbox with the best creative AI links and resources that our community members are sharing each week.

Shout-out to @cheriehu, @aflores, @moises.tech, @BenLondon12, @yung spielburg, and @KatherineOlivia for curating this week’s featured links:

Music-industry case studies

AI tools, models, and datasets

Other resources

Follow more updates in Discord

If you’re not already in our Discord server, please authorize the Memberful Discord bot, which should automatically give you access. Make sure to hop in #intros and say hi once you’re in!