We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!


This article was contributed by Taesu Kim, CEO of Neosapience.

An AI revolution is going on in the area of content creation. Voice technologies in particular have made tremendous leaps forward in the past few years. While this could lead to a myriad of new content experiences, not to mention dramatically reduced costs associated with content development and localization, there are ample concerns about what the future holds.

Imagine if you are known for your distinctive voice and rely on it for your livelihood — actors like James Earl Jones, Christopher Walken, Samuel L. Jackson, Fran Drescher, and Kathleen Turner, or musicians such as Adele, Billie Eilish, Snoop Dogg, or Harry Styles. If a machine were trained to replicate them, would they lose all artistic control? Would they suddenly be providing voice-overs for a YouTube channel in Russia? And, practically speaking, would they miss out on potential royalties? What about the person who’s looking for a break, or maybe just a way to make some extra cash by licensing their voice or likeness digitally?

A voice is more than a compilation of sounds

There is something tremendously exciting that happens when you can type a series of words, click a button, and hear your favorite superstar read them back, sounding like an actual human with natural rises and falls in their speech, changes in pitch, and intonation. This is not something robotic, as we’ve become accustomed to with characters created from AI. Instead, the character you build comes to life with all of its layered dimensions. 

Event

Transform 2022

Join us at the leading event on applied AI for enterprise business and technology decision makers in-person July 19 and virtually from July 20-28.

Register Here

This depth is what had been lacking from virtual actors and virtual identities previously; the experience was, quite frankly, underwhelming. But modern AI-based voice technology can reveal the construction of an identity whose intricate characteristics come out through the sound of a voice. The same can be true of AI-based video actors that move, gesture, and use facial expressions identical to those of humans, providing the nuances inherent in a being without which characters fall flat.

As technology improves to the point that it can acquire true knowledge of each of the characteristics of a person’s surface identity — such as their looks, sounds, mannerisms, ticks, and anything else that makes up what you see and hear from another, excluding their thoughts and feelings — that identity becomes an actor that can be deployed not only by major studios in big-budget films or album releases. Anyone can select that virtual actor using a service like Typecast and put it to work. The key here is that it is an actor, and even novice actors get paid.

Understandably, there is some fear about how such likenesses can be co-opted and used without licensing, consent, or payment. I would liken this to the issues we’ve seen as any new medium has come onto the scene. For example, digital music and video content that were once thought to rob artists and studios of revenue have become thriving businesses and new money-makers that are indispensable to today’s bottom line. Solutions were developed that led to the advancement of technology, and the same holds true again.

Preservation of your digital and virtual identity

Each human voice — as well as face — has its own unique footprint, comprised of tens of thousands of characteristics. This makes it very, very difficult to replicate. In a world of deep fakes, misrepresentation, and identity theft, a number of technologies can be put to work to prevent the misuse of AI speech synthesis or video synthesis. 

Voice identity or speaker search is one example. Researchers and data scientists can identify and break down the characteristics of a specific speaker’s voice. In doing so, they can determine whose unique voice was used in a video or audio snippet, or whether it was a combination of many voices blended together and converted through text-to-speech technology. Ultimately, such identification capabilities can be applied in a Shazam-like app. With this technology, AI-powered voice and video companies can detect if their text-to-speech technology has been misused. Content can then be flagged and removed. Think of it as a new type of copyright monitoring system. Companies including YouTube and Facebook are already developing such technologies for music and video clips, and it won’t be long until they become the norm.

Deep fake detection is another area where significant research is being conducted. Technology is being developed to distinguish whether a face in a video is an actual human or one that has been digitally manipulated. For instance, one research team has created a system based on a convolutional neural network (CNN) to pull features at a frame-by-frame level. It can then compare them and train a recurrent neural network (RNN) to classify videos that have been digitally manipulated — and it can do this rapidly and at scale. 

These solutions may make some people feel uneasy, as many are still in the works, but let’s put these fears to rest. Detection technologies are being created proactively, with an eye towards future need. In the interim, we have to consider where we are right now and synthesized audio and video must be very sophisticated to clone and deceive. 

An AI system designed to produce voice and/or video can only learn from a clean dataset. Today, this means it can pretty much only come from filming or recording that’s done in a studio. It is remarkably difficult to have data recorded in a professional studio without the consent of the data subject; studios are not willing to risk a lawsuit. Data crawled on YouTube or other sites, by contrast, provides such a noisy dataset that it is only capable of producing low-quality audio or video, which makes it simple to spot and remove illegitimate content. This automatically subtracts the suspects most likely to manipulate and misuse digital and virtual identities. While it will be possible to create high-quality audio and video with noisy datasets eventually, detection technologies will be ready well in advance, providing ample defense. 

Virtual AI actors are still part of a nascent space, but one that is accelerating quickly. New revenue streams and content development possibilities will continue to push virtual characters forward. This, in turn, will provide ample motivation to apply sophisticated detection and a new breed of digital rights management tools to govern the use of AI-powered virtual identities.

Taesu Kim is the CEO of Neosapience.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

Author
Topics