Open-Sora 2.0 Explained: Architecture, Training, and Why It Matters

How Open-Sora 2.0 Built Sora-Level Video AI for $200K

Open-Sora 2.0 Explained: Architecture, Training, and Why It Matters

Watch the video!

Last week, I was at GTC, Nvidia’s annual event, and I had the chance to check out some incredible new technology. One initiative that really caught my attention is a fully open-source video generator called Open‑Sora. They managed to train an end-to-end video generator, so taking text and generating a short video from it, with just 200,000$. Okay, 200,000$ is a lot of money, but it’s quite low compared to what OpenAI’s Sora or other state-of-the-art video generation models cost, like Runway and the others I covered on my channel that require millions to train and get similar results.

Before diving into how they achieved that, let’s begin by understanding the problem itself. Text-to-video generation isn’t like generating a single image from text; it’s about creating a sequence of images that flow together seamlessly over time. You have to capture not only all the fine spatial details of a scene but also ensure that the motion is smooth and realistic over time. This added temporal dimension introduces an entirely new layer of complexity and cost, mainly due to the fact that these AI systems don’t understand time. They only get tokens, which are either our words or pixels. They don’t have an understanding of the laws of physics that humans develop through trial and error when they are babies. They just have access to our world through tokens, making this video time-consistency extremely difficult.

There are essentially two approaches to tackle this problem. The first is to train a model directly to convert text into video, which means the model has to learn both how to generate high‑quality images and how to stitch them together into coherent motion in one go, without any glitch or artifacts. Of course, this is ideal, and it’s what we want to end up with, but we have the same challenges I just mentioned. But there’s a second approach, which instead takes a detour, simplifying the problem with a two‑step process: first, you train a model to generate a high‑quality image from a text prompt, and then you use that model and the image generated as a conditioning signal to generate a video. Open‑Sora 2.0 adopts the second approach because it leverages mature techniques from image generation instead of training a whole end-to-end pipeline from scratch. By starting with a robust, pre‑trained text-to-image model, the system can focus on adding the motion dimension later and extensively focus on each part step by step. This separation makes it much easier to manage the training process, reduces the overall complexity, and significantly cuts down on the required compute and data resources as we will see.

Let’s dive into how Open‑Sora 2.0 is built and trained. 

The training pipeline is not just divided into two stages, but into three distinct stages, each carefully optimized to save compute, reduce cost, and deliver state‑of‑the‑art performance.

In the very first stage, the goal is to establish a robust text-to-video model at a low resolution of 256×256 pixels. Just to be able to make something that seems promising. Instead of starting from scratch or merely fine-tuning an image model, the team takes a different approach. They begin with the FLUX model — a powerful text-to-image diffusion model with 11 billion parameters that generates high‑quality images from text. FLUX works by encoding text prompts into deep semantic representations, which then condition a U‑Net-based diffusion process to progressively denoise a latent image until a final image is produced.

This initial FLUX model already has deep visual understanding capabilities, making it the perfect foundation. However, generating videos requires not just spatial coherence but temporal consistency — smooth, believable motion. To efficiently add this temporal dimension, Open-Sora uses an advanced architecture inspired by MM-DiT (or Multimodal Diffusion Transformer).

In short, MM-DiT processes information in two main steps. First, it handles text and visual data separately using dedicated transformer streams. It’s two distinct pipelines — one for text, another for images or video frames — each focusing independently to capture the best possible representations of their respective data. For text encoding, MM-DiT leverages three powerful pre-trained text models: CLIP-L provides strong baseline textual-visual alignment, and T5-XXL contributes a deeper semantic understanding, particularly beneficial for capturing detailed and complex textual contexts. They don’t use CLIP-G, shown in this figure, for open-sora. These textual features are combined with the visual input (the Noised Latent here), representing images or frames with added noise, which the model progressively denoises to generate clear and coherent visual content.

MM-DiT architecture. (clearly visualized in the video!)

Next, MM-DiT introduces integrated transformer blocks, acting as bridges connecting these two separate streams. This allows a seamless bidirectional flow of information — text data informs visual generation, while visual features influence textual understanding. Essentially, it’s like two experts exchanging insights to improve their mutual understanding.

An important feature of MM-DiT is its learned modulation, meaning the model dynamically adjusts how much it relies on textual prompts versus intermediate visual cues at each step of generating the output. This dynamic adjustment significantly enhances the alignment between what the user writes (text prompts) and the visual results (generated images or videos), ensuring better quality and consistency.

For training data in this initial stage, Open-Sora uses a substantial and carefully curated dataset comprising approximately 70 million short video samples, each at a resolution of 256×256 pixels. This extensive dataset was obtained from various publicly available video sources, filtered, and processed rigorously to ensure high-quality, diverse content suitable for effective training, which was the most expensive step, requiring 2240 GPU days and just a bit over 100k$. But now, we have a quite powerful model that takes a prompt and generates a nice, low-quality video!

Stage two further simplifies the training process by transitioning from text-to-video generation to image-to-video generation, still at 256×256 pixels. In this stage, instead of relying solely on text prompts, the model learns how to extend a single image into a short video. To do this, they modify their conditioning method by encoding the initial image and concatenating it as extra information into the latent video representation. This adjustment allows the model to learn explicit motion generation independently from the complexities of scene creation. By focusing purely on motion, having the information from a FLUX pre-generated image, the training becomes faster, less data-intensive, and significantly cheaper. To ensure robustness, the team also cleverly introduces a dropout mechanism for the image conditioning — randomly forcing the model sometimes to generate videos without an initial image, thus keeping its text-to-video capabilities sharp and not totally dependent on the pre-generated image.

For this stage, the team leverages a reduced dataset containing around 10 million carefully selected, high-quality video samples, still at a resolution of 256×256 pixels. Similar rigorous preprocessing steps are applied to ensure consistency and quality using more filters. Also, by reusing previously generated images from the FLUX model, the computational requirements are significantly reduced, requiring only around 384 GPU days, or 18 thousand dollars.

The third and final stage is all about refining this model and scaling it to high-resolution videos, specifically at 768×768 pixels. However, directly jumping to high resolution would be prohibitively expensive, so Open-Sora employs a novel Video Deep Compression Autoencoder architecture called Video DC-AE, inspired by the Deep Compression Autoencoder (DC-AE) approach initially made for images that we covered in our latent diffusion models video, again transformed for videos adding a 3-dimensional convolution rather than two-dimensional. Initially, the team started with the open-source HunyuanVideo VAE, which achieves a compression ratio of 4 × 8 × 8. While effective, this still resulted in processing around 115,000 tokens per training video, leading to considerable computational demands. Open-Sora increased the spatial compression ratio significantly by 4 times to 32 while maintaining a temporal compression ratio of 4. This adjustment effectively reduced the number of spatial tokens processed, greatly enhancing computational efficiency and preserving essential motion features. And so, Video DC-AE now drastically reduces both spatial and temporal dimensions through deep compression, compressing videos into a significantly smaller latent representation and dramatically reducing the computational load.

This compact latent representation serves as the input for the video generation model, which efficiently processes the combined text and image data within this compressed form. After generating the video in this latent representation, a decoder reconstructs the compressed representation back into high-resolution, visually coherent video frames, ensuring the final output maintains both visual clarity and temporal smoothness.

Open-Sora global architecture (clearly visualized in the video!)

To maintain video quality at this high resolution, Open-Sora uses a sophisticated classifier-free guidance strategy, where it separately adjusts how strongly text and image conditions influence the generation process. Specifically, image guidance typically requires a smaller scale to prevent static outputs, while text guidance benefits from a larger scale for better semantic alignment. To optimize quality further, Open-Sora introduces a dynamic scaling approach for image guidance, which varies according to both the video frame and the denoising step. Initially, frames towards the end of the video need stronger image guidance to maintain coherence, whereas later denoising steps, where the video content is mostly formed, require less guidance. This balanced approach, combined with guidance oscillation — alternating guidance scales at different steps — helps maintain visual stability and reduce flickering. Additionally, Open-Sora explicitly models motion intensity through a dedicated motion score parameter. By adjusting this motion score during inference, users can precisely control the video’s dynamism, achieving the desired balance between minimal, high-fidelity motion and more dynamic, energetic scenes, which I haven’t seen before and seems quite cool!

For this final stage, they further compressed their dataset to around 5 million carefully selected, high-quality video samples at 768 pixels with again more filters, now requiring around 1536 GPU days or 73 thousand dollars.

Data processing pipeline (image from the Open-Sora 2.0 report).

So, splitting the training process, starting from text-to-image strengths and existing approaches, transitioning into efficient motion modeling at lower resolution, and finally refining at high resolution, Open-Sora 2.0 achieves state-of-the-art video generation quality comparable to much more expensive models but at a fraction of the cost.

Open-Sora 2.0’s talk at GTC was all about their cost efficiency, which, I must admit, seemed quite impressive compared to what we are used to hearing online and made me curious to hear more and make this video. While training comparable models like MovieGen and Step-Video-T2V typically incurs expenses in the millions of dollars, Open-Sora 2.0 achieved very similar performance levels with a training cost of “just” $200,000, renting H200 GPUs at 2$ per hour for a total of 4160 hours. Obviously, it’s still expensive to get something powerful and own it, but they still managed to get it down by ten times and shared how they achieved it.

Best of all, it’s completely open source, giving everyone a chance to be part of this exciting revolution in video generation. I put all the links in the references below if you want to try it out or read more. 

Thank you for reading! 

I hope you found this article interesting, and I’ll see you in the next one!


All references;

Open-Sora 2.0 report: https://arxiv.org/pdf/2503.09642v1

Open-Sora Github: https://github.com/hpcaitech/Open-Sora

Open-Sora 1 paper: https://arxiv.org/pdf/2412.20404

Open-Sora HF model: https://huggingface.co/hpcai-tech/Open-Sora-v2

Flux: https://github.com/black-forest-labs/flux

HunyanVideo: https://arxiv.org/pdf/2412.03603v1

GTC Open-Sora 2.0 talk: https://register.nvidia.com/flow/nvidia/gtcs25/ap/page/catalog/session/1727765669275001cZgX