As LLMs advance, video-generation capabilities emerge as the next frontier. OpenAI’s Sora has truly impressed with its hyper-realistic video generation skills. Here, we present some compelling alternatives that you can use and experiment with.
RunwayML Gen 2
RunwayML Gen 2 allows users to create entire worlds, animations, and stories simply by providing text descriptions. Users can also experiment with reference images, utilising various prompting modes and advanced settings to fine-tune their creative process.
The recent addition of the Multi-Motion Brush enhances control over motion within generated videos. Gen-2 is accessible on both the Runway web platform and their mobile app, providing flexibility for creative endeavours on the go.
Users can preview and download generated videos, selecting the one that aligns with their vision. However, considerations include cost implications, with Gen-2 operating on a credit system, and each second of video generation priced at $.05.
Pika
Pika Labs is an AI text-to-video tool that enables users to create videos and animations from simple text prompts. Pika can generate videos in various styles, ranging from cartoons and anime to cinematic formats. Not confined solely to text-to-video conversion, Pika can also transform images into videos and perform video-to-video conversions.
Recently, Pika introduced a lip-sync feature, allowing users to add voice to characters, with Pika seamlessly syncing words to their movements. Additional features include ‘modify region’ and ‘expand canvas’.
Lumiere
Lumiere is the closest competitor to Sora from Google DeepMind, as it, too, creates realistic and coherent videos directly from textual descriptions, with a duration of up to five seconds.
In contrast to many text-to-video models that generate videos frame-by-frame, Lumiere employs a Space-Time Diffusion Model. This approach allows Lumiere to generate the entire video’s duration in one go, ensuring better coherence and consistency throughout.
Lumiere stands out with unique features, including image-to-video generation, stylised generation, cinemagraphs, and inpainting, setting it apart from other models in terms of versatility and customisation options.
Imagen Video
Imagen Video from Google is a text-conditional video generation system based on a cascade of video diffusion models. This model can produce 1280×768 videos at 24 frames per second. Not only does the model create top-notch videos, but it also offers a high level of control and a broad understanding of the world.
It can produce a variety of videos and text animations in different artistic styles, showcasing a solid grasp of 3D objects.
Emu Video
Meta’s Emu Video allows you to create short videos based on text descriptions. It utilises a diffusion model approach. This means it starts with a noisy image and progressively refines it based on the text prompt until it generates the final video frame by frame
It employs a two-step process: First, an image is generated based on the text prompt. Then, using that image and the prompt again, the model creates a multi-frame video
This model produces visually striking 512×512 four-second videos at 16 frames per second, outperforming models like Make-a-Video, Imagen-Vide, Cog Video, Gen2 and Pika.
CogVideo
A team of researchers from the University of Tsinghua in Beijing has introduced CogVideo, a large-scale pretrained text-to-video generative model. CogVideo employs a multi-frame-rate hierarchical training strategy and builds upon a pre-trained text-to-image model known as CogView2.
VideoPoet
VideoPoet is an LLM developed by Google Research specifically for video generation. It can generate two-second videos based on various input formats, including text descriptions, existing images, videos, and audio clips.
VideoPoet offers some level of control over the generation process. You can experiment with different text prompts, reference images, or adjust specific settings to refine the final video output. Moreover, it offers features such as zero-shot stylization and applying visual effects.
Stable Video Diffusion
Stable Video Diffusion from Stability AI is an open-source tool that transforms text and image inputs into vivid scenes, elevating concepts into live-action cinematic creations. It comes with two image-to-video models that can create 14 and 25 frames, offering customisable frame rates from 3 to 30 frames per second.
Make A Video
Developed by Meta AI, Make-A-Video translates progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) without requiring text-video data. It learns visual and multimodal representations from paired text-image data and motion from unsupervised video footage.
Magic VideoV2
ByteDance’s Magic Video 2, also known as MagicVideo, is an efficient video-generation framework based on latent diffusion models. MagicVideo-V2 integrates text-to-image, image-to-video, video-to-video, and video frame interpolation, providing a new strategy for generating smooth and highly aesthetic videos.