Meta has released AudioCraft, a new open-source generative AI framework that can produce music from simple text prompts.
AudioCraft is based on a dynamic framework that enables high-quality, realistic audio and music generation from text-based user inputs. It aims to revolutionize music generation by empowering professional musicians to explore new compositions, indie game developers to enhance their virtual worlds with sound effects, and small business owners to add soundtracks to their Instagram posts, all with ease.
AudioCraft is a collection of three robust models: MusicGen, AudioGen and EnCodec. While MusicGen uses text-based user inputs to generate music, AudioGen performs a similar role for ambient sounds. Both are trained with Meta-owned and specifically licensed music and public sound effects, respectively. A recent release from the company offers an improved version of EnCodec. This decoder allows for high-quality music generation with fewer artifacts, based on the pre-trained AudioGen and all AudioCraft model weights and code.
Meta could achieve this breakthrough with AudioCraft due to the enormous strides made in recent years by generative AI models, including language models. These models can generate various images, videos, and text from user descriptions, demonstrating advanced spatial understanding. But, audio generation has lagged behind due to its complexity. This is where AudioCraft makes a significant difference.
AudioCraft’s source code is available on GitHub for researchers and practitioners to access these models and train them with their datasets.
AudioCraft simplifies the generation of high-fidelity audio, which traditionally required complex modeling of signals and patterns at varying scales. Especially in music, which involves local and long-range patterns, generative AI has relied heavily on symbolic representations like MIDI or piano rolls. However, these methods cannot capture the intricate expressive nuances and stylistic elements found in music. AudioCraft models, by comparison, produce high-quality audio with long-term consistency through a more user-friendly interface.
For audio generation, AudioCraft relies on a two-pronged approach. First, it employs the EnCodec neural audio codec to create a new fixed “vocabulary” for music samples by learning discrete audio tokens from the raw signal. Next, it uses autoregressive language models over these discrete audio tokens to generate new tokens, sounds, and music. EnCodec, specifically trained to compress any audio and reconstruct the original signal with high fidelity, uses an autoencoder with a residual vector quantization bottleneck to produce several parallel streams of audio tokens with a fixed vocabulary. This enables high-fidelity reconstruction of audio from all streams.
MusicGen, one of the underlying models of AudioCraft, is specifically tailored for music generation. With training on approximately 400,000 recordings along with text description and metadata, amounting to 20,000 hours of music, MusicGen excels at generating coherent samples on the long-term structure, a crucial aspect of creating novel musical pieces.
While AudioCraft has made substantial advancements, Meta’s research team continues to strive for more. Its future research focuses on improving the models’ speed and efficiency and improving model control. This will not only open up new scenarios and possibilities but will also aid in exploring additional conditioning methods, pushing the ability of models to capture even longer-range dependencies and understanding the limitations and biases of models trained on audio.
Meta claims it maintains a high commitment to responsibility and transparency in its research. Recognizing the lack of diversity in the datasets used to train the models, it is committed to rectifying this issue. By sharing the code for AudioCraft, Meta aims to mitigate potential bias in and misuse of generative models by enabling other researchers to test new approaches more easily.
Emphasizing the importance of open source, the company ensures equal access to its research and models. It has released model cards detailing how AudioGen and MusicGen were built, following Responsible AI practices. This open-source foundation will foster innovation and complement how audio and music are produced and listened to in the future.
Meta is not the only company to release music-based models. In January 2023, Google introduced MusicLM, a foundation model that can generate high-fidelity music from text descriptions. To access the model, you can join the waitlist at Google’s AI Test Kitchen.
The AudioCraft framework holds a promising future in generative AI. It signifies an important step forward in generative AI research with its ability to generate robust, coherent, and high-quality audio samples. This advancement significantly impacts the development of advanced human-computer interaction models, considering auditory and multi-modal interfaces. As AudioCraft continues to evolve, it may become mature enough to produce background music for movies.