– Google has developed a new large language model (LLM) called VideoPoet for video generation tasks.
– VideoPoet is trained on 270 million videos and over 1 billion text-and-image pairs from the internet.
– VideoPoet uses LLM, a different type of AI model based on the transformer architecture.
– It can generate longer, higher quality clips with more consistent motion compared to other video generation models.
– VideoPoet offers a seamless, all-in-one solution for video creation by integrating multiple video generation capabilities.
– Human evaluators preferred VideoPoet clips over clips generated by other models.
– VideoPoet is tailored to produce videos in portrait orientation, catering to the mobile video marketplace.
– Google Research plans to expand VideoPoet’s capabilities to support other generation tasks in the future.
– VideoPoet is not currently available for public usage.
Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.
Just yesterday, I asked if Google would ever get an AI product release right on the first try. Consider that asked and answered — at least, going by the looks of its latest research.
This week, Google showed off VideoPoet, a new large language model (LLM) designed for a variety of video generation tasks from a team of 31 researchers at Google Research.
The fact that the Google Research team built an LLM for these tasks is notable in-and-of-itself. As they write in their pre-review research paper: “Most existing models employ diffusion-based methods that are often considered the current top performers in video generation. These video models typically start with a pretrained image model, such as Stable Diffusion, that produces high-fidelity images for individual frames, and then fine-tune the model to improve temporal consistency across video frames.”
By contrast, instead of using a diffusion model based on the popular — and controversial Stable Diffusion open source image/video generating AI — the Google Research team decided to use an LLM, a different type of AI model based on the transformer architecture, typically used for text and code generation, such as in ChatGPT, Claude 2, or Llama 2. But instead of training it to produce text and code, the Google Research team trained it to generate videos.
The AI Impact Tour
Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!
Pre-training was key
They did this by heavily “pre-training” the VidePoet LLM on 270 million videos and more than 1 billion text-and-image pairs from “the public internet and other sources,” and specifically, turning that data into text embeddings, visual tokens, and audio tokens, on which the AI model was “conditioned.”
Longer, higher quality clips with more consistent motion
More than this, the Google Research team notes that their LLM video generator approach may actually allow for longer, higher quality clips, eliminating some of the constraints and issues with current diffusion-based video generating AIs, where movement of subjects in the video tends to break down or turn glitchy after just a few frames.
“One of the current bottlenecks in video generation is in the ability to produce coherent large motions,” two of the team members, Dan Kondratyuk and David Ross, wrote in a Google Research blog post announcing the work. “In many cases, even the current leading models either generate small motion or, when producing larger motions, exhibit noticeable artifacts.”
By contrast, VideoPoet can generate larger and more consistent motion across longer videos of 16 frames, based on the examples posted by the researchers online. It also allows for a wider range of capabilities right from the jump, including simulating different camera motions, different visual and aesthetic styles, even generating new audio to match a given video clip. It also handles a range of inputs including text, images, and videos to serve as prompts.
Integrating all these video generation capabilities within a single LLM, VideoPoet eliminates the need for multiple, specialized components, offering a seamless, all-in-one solution for video creation.
In fact, viewers surveyed by the Google Research team preferred it. The researchers showed video clips generated by VideoPoet to an unspecified number of “human raters,” as well as clips generated by video generation diffusion models Source-1, VideoCrafter, and Phenaki, showing two clips at a time side-by-side. The human evaluators largely rated the VideoPoet clips as superior in their eyes.
As summarized in the Google Research blog post: “On average people selected 24–35% of examples from VideoPoet as following prompts better than a competing model vs. 8–11% for competing models. Raters also preferred 41–54% of examples from VideoPoet for more interesting motion than 11–21% for other models.” You can see the results displayed in a bar chart format below as well.
Built for vertical video
Google Research has tailored VideoPoet to produce videos in portrait orientation by default, or “vertical video” catering to the mobile video marketplace popularized by Snap and TikTok.
Looking ahead, Google Research envisions expanding VideoPoet’s capabilities to support “any-to-any” generation tasks, such as text-to-audio and audio-to-video, further pushing the boundaries of what’s possible in video and audio generation.
There’s only one problem I see with VideoPoet right now: it’s not currently available for public usage. We’ve reached out to Google for more information on when it might become available and will update when we hear back. But until then, we’ll have to wait eagerly for its arrival to see how it really compares to other tools on the market.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.
AI Eclipse TLDR:
Google Research has unveiled VideoPoet, a large language model (LLM) designed for video generation tasks. Unlike existing models that use diffusion-based methods, VideoPoet is based on the transformer architecture typically used for text and code generation. The model was pre-trained on 270 million videos and over 1 billion text-and-image pairs from the internet, enabling it to generate longer, higher quality video clips with more consistent motion. VideoPoet also offers a range of capabilities, including simulating camera motions, visual styles, and generating new audio. In comparison to other video generation models, VideoPoet was preferred by human evaluators in terms of following prompts and producing interesting motion. However, it is currently not available for public use. Google Research plans to expand VideoPoet’s capabilities for text-to-audio and audio-to-video generation tasks in the future.