The Runway AI Inc., a startup specializing in artificial intelligence, has announced its latest system, Gen-2, which can generate video clips from text descriptions. The system is capable of creating short videos of a few seconds in length based on a few words typed by the user.
The user types a description of what they want to see, and the system will generate a three-second clip showing exactly what they described or something close to it. Users can also attach an image as a reference point. The Gen 2 works on a waiting list, and users can sign up to access it on a private Discord channel that the company plans to expand.
The announcement marks a significant advancement in text-to-video generation outside of a laboratory setting. Last year, both Google and Meta Platforms Inc. demonstrated their own text-to-video conversion efforts, showcasing short video clips of a teddy bear washing dishes or a sailboat on a lake. However, neither company has announced plans to release the system beyond the research phase.
The Runway has been working on AI tools since 2018 and raised $50 million last year. The startup helped create the original version of Stable, a text-to-image AI model that has since been popularized and developed by Stability AI. In a live demonstration last week with co-founder and CEO Cris Valenzuela, the Gen 2 was put to the test with a request for a “drone shot of a desert landscape.” In just a few minutes, the Gen 2 generated a short video that was undeniably a drone shot of a desert landscape, complete with a blue sky and clouds on the horizon and the sun in the upper-right corner of the video.
Several other videos generated by Runway show the strengths and weaknesses of the system. A close-up image of an eyeball looks sharp and human-like, while a clip of a hiker in the jungle shows that there are still issues with generating realistic-looking legs and walking movements. The model has not yet figured out how to accurately portray moving objects, Valenzuela said. “You can generate a car chase, but sometimes the cars can fly,” he said.
Although longer commands can lead to a more detailed image in a text-to-image model like DALL-E or Stable Diffusion, Valenzuela said that simpler commands work better with the Gen 2.
The tool is based on an existing model called Gen 1, which the Runway began testing on Discord in February. Valenzuela said the group now has thousands of users. This AI model requires users to upload an image as the input source, which will be used along with the text command to generate a three-second silent video. For example, a user could upload a photo of a cat chasing a toy, along with the text “cute crochet style,” and the Gen 1 would generate a video of a crocheted cat chasing a toy.
Videos created with Gen 1 are also silent, but Valenzuela said the company is researching audio generation, hoping to eventually create a system capable of generating videos and sounds. The Gen 2 system has many potential applications, from creating short marketing videos to adding visual effects to films and TV shows. While it may not be perfect yet, the Runway is excited about the possibilities and plans to continue improving the system.