mirror of
https://modelscope.cn/models/iic/Image-to-Video
synced 2026-04-02 11:32:53 +08:00
Update README.md
This commit is contained in:
20
README.md
20
README.md
@ -41,13 +41,13 @@ widgets:
|
|||||||
|
|
||||||
# I2VGen-XL高清图像生成视频大模型
|
# I2VGen-XL高清图像生成视频大模型
|
||||||
|
|
||||||
本项目**I2VGen-XL**旨在解决根据输入图像生成高清视频任务。**I2VGen-XL**由达摩院研发的高清视频生成基础模型之一,其核心部分包含两个阶段,分别解决语义一致性和清晰度的问题,参数量共计约37亿,模型经过在大规模视频和图像数据混合预训练,并在少量精品数据上微调得到,该数据分布广泛、类别多样化,模型对不同的数据均有良好的泛化性。项目相比于现有视频生成模型,**I2VGen-XL**在清晰度、质感、语义、时序连续性等方面均具有明显的优势。
|
本项目**Image-to-Video**旨在解决根据输入图像生成高清视频任务。**Image-to-Video**由达摩院研发的高清视频生成基础模型之一,其核心部分包含两个阶段,分别解决语义一致性和清晰度的问题,参数量共计约37亿,模型经过在大规模视频和图像数据混合预训练,并在少量精品数据上微调得到,该数据分布广泛、类别多样化,模型对不同的数据均有良好的泛化性。项目相比于现有视频生成模型,**I2VGen-XL**在清晰度、质感、语义、时序连续性等方面均具有明显的优势。
|
||||||
|
|
||||||
此外,**I2VGen-XL**的许多设计理念和设计细节(比如核心的UNet部分)都继承于我们已经公开的工作**VideoComposer**,您可以参考我们的[VideoComposer](https://videocomposer.github.io)和本项目[ModelScope](https://github.com/modelscope/modelscope)的了解详细细节。
|
此外,**Image-to-Video**的许多设计理念和设计细节(比如核心的UNet部分)都继承于我们已经公开的工作**VideoComposer**,您可以参考我们的[VideoComposer](https://videocomposer.github.io)和本项目[ModelScope](https://github.com/modelscope/modelscope)的了解详细细节。
|
||||||
|
|
||||||
The **I2VGen-XL** project aims to address the task of HD video generation based on input images. **I2VGen-XL** is one of the HQ video generation base models developed by DAMO Academy. Its core components consist of two stages, each addressing the issues of semantic consistency and video quality. The total number of parameters is approximately 3.7 billion. The model has been pre-trained on a large-scale mixture of video and image data and fine-tuned on a small amount of high-quality data. This data distribution is extensive and diverse, and the model demonstrates good generalization to different types of data. Compared to existing video generation models, the **I2VGen-XL** project has significant advantages in terms of quality, texture, semantics, and temporal continuity.
|
The **Image-to-Video** project aims to address the task of HD video generation based on input images. **Image-to-Video** is one of the HQ video generation base models developed by DAMO Academy. Its core components consist of two stages, each addressing the issues of semantic consistency and video quality. The total number of parameters is approximately 3.7 billion. The model has been pre-trained on a large-scale mixture of video and image data and fine-tuned on a small amount of high-quality data. This data distribution is extensive and diverse, and the model demonstrates good generalization to different types of data. Compared to existing video generation models, the **I2VGen-XL** project has significant advantages in terms of quality, texture, semantics, and temporal continuity.
|
||||||
|
|
||||||
Additionally, many design concepts and details of **I2VGen-XL** (such as the core UNet) are inherited from our publicly available work, **VideoComposer**. For detailed information, please refer to our [VideoComposer](https://videocomposer.github.io) and the Github code repository for this [ModelScope](https://github.com/modelscope/modelscope) project.
|
Additionally, many design concepts and details of **Image-to-Video** (such as the core UNet) are inherited from our publicly available work, **VideoComposer**. For detailed information, please refer to our [VideoComposer](https://videocomposer.github.io) and the Github code repository for this [ModelScope](https://github.com/modelscope/modelscope) project.
|
||||||
|
|
||||||
<center>
|
<center>
|
||||||
<p align="center">
|
<p align="center">
|
||||||
@ -59,7 +59,7 @@ Additionally, many design concepts and details of **I2VGen-XL** (such as the co
|
|||||||
|
|
||||||
## 模型介绍 (Introduction)
|
## 模型介绍 (Introduction)
|
||||||
|
|
||||||
如图Fig.2所示,**I2VGen-XL**是一种基于隐空间的视频扩散模型(VLDM),其通过我们专门设计的时空UNet(ST-UNet)在隐空间中进行时空建模,然后通过解码器重建出最终视频(具体模型结构可以参考[VideoComposer](https://videocomposer.github.io))。为能够生成720P视频,我们将**I2VGen-XL**分为两个阶段,第一阶段是在低分辨率条件下保证语义一致性,第二阶是利用新的VLDM进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、数据和训练上的联合优化,**I2VGen-XL**主要具有以下几个特点:
|
如图Fig.2所示,**Image-to-Video**是一种基于隐空间的视频扩散模型(VLDM),其通过我们专门设计的时空UNet(ST-UNet)在隐空间中进行时空建模,然后通过解码器重建出最终视频(具体模型结构可以参考[VideoComposer](https://videocomposer.github.io))。为能够生成720P视频,我们将**Image-to-Video**分为两个阶段,第一阶段是在低分辨率条件下保证语义一致性,第二阶是利用新的VLDM进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、数据和训练上的联合优化,**Image-to-Video**主要具有以下几个特点:
|
||||||
|
|
||||||
- 高清&宽屏,可以直接生成720P(1280*720)分辨率的视频,且相比于现有的开源项目,不仅分辨率得到有效提高,其生产的宽屏视频可以适合更多的场景
|
- 高清&宽屏,可以直接生成720P(1280*720)分辨率的视频,且相比于现有的开源项目,不仅分辨率得到有效提高,其生产的宽屏视频可以适合更多的场景
|
||||||
- 连续性,通过特定训练和推理策略,在视频的细节生成的稳定性上(时间和空间维度)有明显提高
|
- 连续性,通过特定训练和推理策略,在视频的细节生成的稳定性上(时间和空间维度)有明显提高
|
||||||
@ -68,7 +68,7 @@ Additionally, many design concepts and details of **I2VGen-XL** (such as the co
|
|||||||
|
|
||||||
以下为生成的部分案例:
|
以下为生成的部分案例:
|
||||||
|
|
||||||
As shown in Fig.2, **I2VGen-XL** is a video latent diffusion model. It utilizes our designed ST-UNet ((for model details, please refer to [VideoComposer](https://videocomposer.github.io))) to perform spatio-temporal modeling in the latent space and reconstruct the generated video through a decoder. In order to generate 720P videos, we divide I2VGen-XL into two stages. The first stage ensures semantic consistency with low resolutions, while the second stage utilizes the new VLDM to denoise and improve video resolution, as well as enhance temporal and spatial consistency. Through joint optimization of the model, data, and training, **I2VGen-XL** has the following characteristics.
|
As shown in Fig.2, **Image-to-Video** is a video latent diffusion model. It utilizes our designed ST-UNet ((for model details, please refer to [VideoComposer](https://videocomposer.github.io))) to perform spatio-temporal modeling in the latent space and reconstruct the generated video through a decoder. In order to generate 720P videos, we divide Image-to-Video into two stages. The first stage ensures semantic consistency with low resolutions, while the second stage utilizes the new VLDM to denoise and improve video resolution, as well as enhance temporal and spatial consistency. Through joint optimization of the model, data, and training, **Image-to-Video** has the following characteristics.
|
||||||
|
|
||||||
- High-definition & widescreen, can directly generate 720P (1280*720) resolution videos, and compared to existing open source projects, not only is the resolution effectively improved, but the widescreen videos it produces can also be suitable for more scenarios.
|
- High-definition & widescreen, can directly generate 720P (1280*720) resolution videos, and compared to existing open source projects, not only is the resolution effectively improved, but the widescreen videos it produces can also be suitable for more scenarios.
|
||||||
- Continuity, through specific training and inference strategies, there is a significant improvement in the stability of detail generation in videos (in the time and space dimensions).
|
- Continuity, through specific training and inference strategies, there is a significant improvement in the stability of detail generation in videos (in the time and space dimensions).
|
||||||
@ -285,9 +285,9 @@ sudo apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
其次,本**I2VGen-XL**项目适配ModelScope代码库,以下是本项目需要安装的部分依赖项。
|
其次,本**Image-to-Video**项目适配ModelScope代码库,以下是本项目需要安装的部分依赖项。
|
||||||
|
|
||||||
The **I2VGen-XL** project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project.
|
The **Image-to-Video** project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project.
|
||||||
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -352,7 +352,7 @@ Please visit <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">
|
|||||||
### 模型局限 (Limitation)
|
### 模型局限 (Limitation)
|
||||||
|
|
||||||
|
|
||||||
目前,我们发现**I2VGen-XL**方法在处理以下情况会存在一定的局限性:
|
目前,我们发现**Image-to-Video**方法在处理以下情况会存在一定的局限性:
|
||||||
- 小目标生成能力有限,在生成较小目标的时候,会存在一定的错误
|
- 小目标生成能力有限,在生成较小目标的时候,会存在一定的错误
|
||||||
- 快速运动目标生成能力有限,当生成快速运动目标时,可能会出现一些假象和不合理的情况
|
- 快速运动目标生成能力有限,当生成快速运动目标时,可能会出现一些假象和不合理的情况
|
||||||
- 生成速度较慢,生成高清视频会明显导致生成速度减慢
|
- 生成速度较慢,生成高清视频会明显导致生成速度减慢
|
||||||
@ -360,7 +360,7 @@ Please visit <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">
|
|||||||
此外,我们研究也发现,生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象,在本项目我们选择了其折中的模型,兼顾两者间的平衡。
|
此外,我们研究也发现,生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象,在本项目我们选择了其折中的模型,兼顾两者间的平衡。
|
||||||
|
|
||||||
|
|
||||||
Currently, we have found certain limitations of the I2VGen-XL method in handling the following situations:
|
Currently, we have found certain limitations of the Image-to-Video method in handling the following situations:
|
||||||
- Limited ability to generate small objects. There may be some errors when generating smaller objects.
|
- Limited ability to generate small objects. There may be some errors when generating smaller objects.
|
||||||
- Limited ability to generate fast-moving objects. There may be some artifacts when generating fast-moving objects.
|
- Limited ability to generate fast-moving objects. There may be some artifacts when generating fast-moving objects.
|
||||||
- Slow generation speed. Generating high-definition videos significantly slows down the generation speed.
|
- Slow generation speed. Generating high-definition videos significantly slows down the generation speed.
|
||||||
|
|||||||
Reference in New Issue
Block a user