Starter Pack: Music AI deepfakes

By: Cherie Hu

Published: 2023-04-18

W&M STARTER PACKS is a free series unpacking essential foundational concepts for navigating music and tech. In each issue, we ground a timely music-tech topic in evergreen findings from our research projects.

What do Drake, Rihanna, Eminem, Harry Styles, and Ariana Grande have in common?

They’re all massive stars, of course. But there’s something thornier: Unlicensed deepfake songs with AI versions of their voices have generated hundreds of thousands of streams on TikTok, YouTube, and Spotify in the last few weeks.

For the most part, it isn’t evil corporations generating this deepfake music (UMG could only wish). Instead, it’s everyday music fans — who are using AI and homegrown training datasets to concoct the ridiculous songs of their dreams, like playing fan-fiction Mad Libs. Kanye covering Gotye? Ariana Grande covering SZA? Drake rapping about how he doesn’t want beans in his chili? Literally anything is possible.

In February 2023, we released an in-depth report on creative AI’s legal, ethical, and commercial implications for the music industry. Throughout, we provided accessible frameworks for artists and music-industry professionals to navigate forthcoming watershed moments in music AI adoption — like exactly what we’re dealing with right now.

Armed with this research, we’ve provided an intro breakdown below to understand how deepfake songs work, how we got here, and what kinds of battles to expect moving forward. If you’re curious, scared, or just plain confused about the current landscape, look no further — we’re here to help!

Is this trend new?

No. In fact, music tools that leverage voice AI predate modern voice assistants: Yamaha released its first vocaloid product (which allows users to synthesize voices with lyrical and melodic inputs) in 2004, and the iconic Hatsune Miku vocaloid mascot had her first release in 2007. While vocaloid software is tedious to use, its most famous virtual avatar representatives have millions of YouTube subscribers and have sold out stadiums around the world.

The acceleration of generative AI development in the last six months points to a new, more seamless generation of raw voice synthesis. Today, several off-the-shelf tools including Uberduck, Eleven Labs, and Descript allow artists and brands to generate high-quality, convincing AI voices with just minutes of training data.

How are deepfake songs made?

There are a few different approaches in the market.

A big misconception is that all deepfake songs are completely generated with AI. In many of the highest-profile cases, this isn’t true. Instead, a human producer still writes, records, and arranges the underlying musical elements — including melody, harmony, beats, and sometimes the base vocal itself — and then overlays and adjusts the synthesized celebrity voice on top to fit the overall production. (YouTube creator Roberto Nickson took this approach for his Kanye AI demonstration.)

In the case of AI covers, fans isolate the vocals from the original song with a stem separation tool, use a vocal transfer model to convert the vocals in another celebrity’s style, then re-stitch that new vocal track together with the original production. Diff-SVC is an especially popular voice transfer model for this purpose. (If you’ve ever heard someone mention “timbre transfer,” they are talking about converting one sound’s tonal quality to another — like voice-to-flute, voice-to-piano, or cello-to-sax. Voice-to-voice is a specific use of timbre transfer.)

In other cases though — like with the now-shuttered app drayk.it, which allowed users to generate a Drake song with just a text prompt — the entire production is AI-generated from scratch, where developers layer multiple different models for lyric generation, voice synthesis, and music synthesis together for a one-click user experience.

With today’s tools, generating a convincing deepfake of an artist’s voice takes just a few minutes of vocal samples. They still need to be high-quality samples that represent the style and cadence that the user wants to represent in the final output. And the generated song may still require some post-processing to sound more polished and professional — perhaps with AI-assisted mixing and mastering tools like those on iZotope and LANDR. But the limited amount of training inputs needed means that seasoned vocalists are likely sitting on more than enough data to build their AI clone.

What are the legal issues with deepfake songs?

There are a few key starting questions to unpack the web of legal complexities around a given deepfake song:

Is the training data licensed? Most of the notable music AI models of the last few years, like Google’s MusicLM and OpenAI’s Jukebox, are built on millions of recordings worth of training data — much of which is copyrighted. If a developer painstakingly gathered Ariana Grande vocal samples at home to train their custom Ari voice model, they likely used part of a copyrighted recording without the original owner’s permission. Training data also sometimes travels across platforms without the original owners knowing — for instance, Apple once trained their own AI voice narration for audiobooks on Spotify-owned audiobook data.

Do artists get compensated for being included in training datasets? As we covered in previous research, consent, attribution, and compensation for AI model training data are less clear-cut than traditional sampling in the music industry. There is no standard for compensating artists for AI training; some platforms buy out vetted samples directly from artists, while others simply ask for forgiveness rather than permission. (Mubert and Infinite Album are two of the only music AI startups we know of that can pay artists royalties on revenue generated from AI outputs using their training data.)

How will streaming services handle AI-generated music? The answer to this depends in part on how streaming services and music distributors flag AI-generated content in their moderation efforts. Turns out, distinguishing between fully AI-generated versus merely AI-assisted content is challenging, as the output alone doesn’t always provide clear indications.

Regardless, major labels are starting to go after deepfakes and speak out against impersonation use cases of AI. In a statement, UMG shared: “We have a moral and commercial responsibility to our artists to work to prevent the unauthorized use of their music and to stop platforms from ingesting content that violates the rights of artists and other creators. We expect our platform partners will want to prevent their services from being used in ways that harm artists.”

Are audio “deepfakes” all bad?

Not necessarily. Deepfakes are a use of voice-model tech, but not all voice modeling is “deepfaking.” When deployed ethically and consensually, voice AI models can be meaningful extensions of artists’ brands.

Hollywood has used consensual voice AI to preserve celebrities’ legacies — such as Lucasfilm and James Earl Jones partnering with Respeecher to recreate the actor’s iconic villain voice from 45 years ago, or Val Kilmer partnering with Sonantic (now owned by Spotify) to create an AI voice for “Top Gun: Maverick” after his throat cancer treatment.

In the music industry, some artists are exploring new economic models around voice impersonation. For example, electronic artist Holly Herndon worked with Never Before Heard Sounds to build her custom Holly+ voice model, which she licensed to a closed network of collaborators in exchange for a revenue share from subsequent works featuring her voice.

Of course, for every positive case study, there may well be tens or hundreds more examples of deepfake songs that are non-consensual at best, and harmful at worst. Artists and their teams would do well to stay informed and proactive by experimenting with the AI tools available to them — including assembling quality training data for fine-tuning their own personal AI models in the future, and developing a strong, tech-agnostic brand identity elsewhere to best weather the coming storm.

We’ll be discussing music AI deepfakes and other emerging tech trends at our inaugural Wavelengths Summit, taking place on May 6 in Brooklyn, NYC. Grab your ticket today!