27 Commits

Author SHA1 Message Date
cd8ed34e75 Update README.md 2023-12-15 03:16:18 +00:00
9efded242d Update README.md 2023-12-15 03:15:33 +00:00
6fa8d55de3 Update README.md 2023-12-15 03:13:26 +00:00
aafa9e654f Update README.md 2023-12-15 03:12:20 +00:00
50b7c35502 Update README.md 2023-09-06 23:45:51 +00:00
148e841911 Change background 2023-09-04 08:40:40 +00:00
796d679b28 Update README.md 2023-09-04 07:41:21 +00:00
e9894cb503 Update README.md 2023-09-04 07:35:30 +00:00
54df7534fa Update README.md 2023-09-04 07:34:10 +00:00
6286c1c897 Update README.md 2023-09-01 12:25:11 +00:00
08683b0c05 Update README.md 2023-09-01 11:11:38 +00:00
23290fd92a Update README.md 2023-08-25 08:05:11 +00:00
658f59a348 Update README.md 2023-08-25 08:02:42 +00:00
d88b970c4f Update README.md 2023-08-25 07:13:54 +00:00
dee731a30f Update README.md 2023-08-25 07:13:16 +00:00
2077c531a7 Update README.md 2023-08-25 07:04:42 +00:00
c7d38c38b0 Update README.md 2023-08-25 06:57:15 +00:00
73db542b43 Update README.md 2023-08-25 06:53:28 +00:00
b15a0a3091 Update README.md 2023-08-25 06:51:51 +00:00
ba65b78a0a Update README.md 2023-08-25 06:27:08 +00:00
6cd7bd2ab4 Update README.md 2023-08-25 06:24:16 +00:00
db1b4b5e7c Update README.md 2023-08-25 06:21:22 +00:00
dad3219222 Update README.md 2023-08-25 06:18:50 +00:00
a12a2fda43 Update README.md 2023-08-25 06:09:03 +00:00
6744cd3091 Update README.md 2023-08-25 05:59:37 +00:00
bc9f30c561 Update README.md 2023-08-25 03:51:49 +00:00
7d935ccca9 Update README.md 2023-08-25 02:28:54 +00:00
2 changed files with 61 additions and 38 deletions

View File

@ -9,8 +9,6 @@ license: CC-BY-NC-ND
metrics:
- realism
- image-video similarity
studios:
- damo/Image-to-Video
tags:
- image2video generation
- diffusion model
@ -41,50 +39,50 @@ widgets:
task: image-to-video
---
# Image-to-Video
# Image-to-Video高清图像生成视频大模型
本项目**MS-Image2Video**旨在解决根据输入图像生成高清视频任务。**MS-Image2Video**由达摩院研发的高清视频生成基础模型其核心部分包含两个阶段分别解决语义一致性和清晰度的问题参数量共计约37亿模型经过在大规模视频和图像数据混合预训练并在少量精品数据上微调得到该数据分布广泛、类别多样化模型对不同的数据均有良好的泛化性。项目于现有视频生成模型,**MS-Image2Video**在清晰度、质感、语义、时序连续性等方面均具有明显的优势。
本项目**Image-to-Video**旨在解决根据输入图像生成高清视频任务。**Image-to-Video**由达摩院研发的高清视频生成基础模型之一其核心部分包含两个阶段分别解决语义一致性和清晰度的问题参数量共计约37亿模型经过在大规模视频和图像数据混合预训练并在少量精品数据上微调得到该数据分布广泛、类别多样化模型对不同的数据均有良好的泛化性。项目相比于现有视频生成模型,**Image-to-Video**在清晰度、质感、语义、时序连续性等方面均具有明显的优势。
此外,**MS-Image2Video**的许多设计理念继承于我们已经公开的工作**VideoComposer**,您可以参考我们的[VideoComposer](https://videocomposer.github.io)和本项目的Github代码库了解详细细节
此外,**Image-to-Video**的许多设计理念和设计细节比如核心的UNet部分继承于我们已经公开的工作**VideoComposer**,您可以参考我们的[VideoComposer](https://videocomposer.github.io)和本项目[ModelScope](https://github.com/modelscope/modelscope)的了解详细细节
The **MS-Image2Video** project aims to address the task of generating high-definition videos based on input images. Developed by Alibaba Cloud, the **MS-Image2Video** is a fundamental model for generating high-definition videos. Its core components consist of two stages that address the issues of semantic consistency and clarity, totaling approximately 3.7 billion parameters. The model is pre-trained on a large-scale mix of video and image data and fine-tuned on a small number of high-quality data sets with a wide range of distributions and diverse categories. The model demonstrates good generalization capabilities for different data types. Compared to existing video generation models, **MS-Image2Video** has significant advantages in terms of clarity, texture, semantics, and temporal continuity.
The **Image-to-Video** project aims to address the task of HD video generation based on input images. **Image-to-Video** is one of the HQ video generation base models developed by DAMO Academy. Its core components consist of two stages, each addressing the issues of semantic consistency and video quality. The total number of parameters is approximately 3.7 billion. The model has been pre-trained on a large-scale mixture of video and image data and fine-tuned on a small amount of high-quality data. This data distribution is extensive and diverse, and the model demonstrates good generalization to different types of data. Compared to existing video generation models, the **Image-to-Video** project has significant advantages in terms of quality, texture, semantics, and temporal continuity.
Additionally, many of the design concepts for **MS-Image2Video** are inherited from our publicly available work, **VideoComposer**. For detailed information, please refer to our [VideoComposer](https://videocomposer.github.io) and the Github code repository for this project.
Additionally, many design concepts and details of **Image-to-Video** (such as the core UNet) are inherited from our publicly available work, **VideoComposer**. For detailed information, please refer to our [VideoComposer](https://videocomposer.github.io) and the Github code repository for this [ModelScope](https://github.com/modelscope/modelscope) project.
<center>
<p align="center">
<img src="assets/image/Fig_twostage.png" style="max-width: none;"/>
<br/>
Fig.1 MS-Image2Video
Fig.1 Overall framework of I2VGen-XL.
<p>
</center>
## 模型介绍 (Introduction)
**MS-Image2Video**建立在Stable Diffusion之上如图Fig.2所示,通过专门设计的时空UNet在隐空间中进行时空建模通过解码器重建出最终视频。为能够生成720P视频我们将**MS-Image2Video**分为两个阶段,第一阶段保证语义一致性但低分辨率第二阶段通过DDIM逆运算并在新的VLDM进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、训练和数据上的联合优化,本项目主要具有以下几个特点:
如图Fig.2所示,**Image-to-Video**是一种基于隐空间的视频扩散模型(VLDM),其通过我们专门设计的时空UNet(ST-UNet)在隐空间中进行时空建模,然后通过解码器重建出最终视频(具体模型结构可以参考[VideoComposer](https://videocomposer.github.io)。为能够生成720P视频我们将**Image-to-Video**分为两个阶段,第一阶段是在低分辨率条件下保证语义一致性,第二阶是利用新的VLDM进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、数据和训练上的联合优化,**Image-to-Video**主要具有以下几个特点:
- 高清&宽屏可以直接生成720P(1280*720)分辨率的视频,且相比于现有的开源项目,不仅分辨率得到有效提高,其生产的宽屏视频可以适合更多的场景
- 无水印,模型通过我们内部大规模无水印视频/图像训练,并在高质量数据微调得到,生成的无水印视频可适用更多视频平台,减少许多限制
- 连续性,通过特定训练和推理策略,在视频的细节生成的稳定性上(时间和空间维度)有明显提高
- 质感好,通过收集特定的风格的视频数据训练,使得生成的模型在质感得到明显提升,可以生成科技感、电影色、卡通风格和素描等类型视频
- 质感好,通过收集特定的风格的视频数据训练,使得生成的视频在质感得到明显提升,可以生成科技感、电影色、卡通风格和素描等类型视频
- 无水印,模型通过我们内部大规模无水印视频/图像训练,并在高质量数据微调得到,生成的无水印视频可适用更多视频平台,减少许多限制
以下为生成的部分案例:
**MS-Image2Video** is built on Stable Diffusion, as shown in Fig.2, and uses a specially designed spatiotemporal UNet to perform spatiotemporal modeling in the latent space, and then reconstructs the final video through the decoder. In order to generate 720P videos, **MS-Image2Video** is divided into two stages. The first stage guarantees semantic consistency but with low resolution, while the second stage uses the DDIM inverse operation and applies denoising on a new VLDM to improve the resolution and spatiotemporal consistency of the video. Through joint optimization of the model, training, and data, this project has the following characteristics:
As shown in Fig.2, **Image-to-Video** is a video latent diffusion model. It utilizes our designed ST-UNet ((for model details, please refer to [VideoComposer](https://videocomposer.github.io))) to perform spatio-temporal modeling in the latent space and reconstruct the generated video through a decoder. In order to generate 720P videos, we divide Image-to-Video into two stages. The first stage ensures semantic consistency with low resolutions, while the second stage utilizes the new VLDM to denoise and improve video resolution, as well as enhance temporal and spatial consistency. Through joint optimization of the model, data, and training, **Image-to-Video** has the following characteristics.
- High-definition & widescreen, can directly generate 720P (1280*720) resolution videos, and compared to existing open source projects, not only is the resolution effectively improved, but the widescreen videos it produces can also be suitable for more scenarios.
- No watermark, the model is trained on a large-scale watermark-free video/image dataset internally and fine-tuned on high-quality data, generating watermark-free videos that can be applied to more video platforms and reducing many restrictions.
- Continuity, through specific training and inference strategies, there is a significant improvement in the stability of detail generation in videos (in the time and space dimensions).
- Good texture, by collecting specific style video data for training, the generated model has a significant improvement in texture and can generate technology, film color, cartoon style, sketch and other types of videos.
- No watermark, the model is trained on a large-scale watermark-free video/image dataset internally and fine-tuned on high-quality data, generating watermark-free videos that can be applied to more video platforms and reducing many restrictions.
Below are some examples generated by the model:
<center>
<p align="center">
<img src="assets/image/fig1_overview.jpg"/>
<img src="assets/image/fig1_overview.jpg" style="max-width: none;"/>
<br/>
Fig.2 VLDM
Fig.2 Architecture of the first stage.
<p>
</center>
@ -273,27 +271,28 @@ Below are some examples generated by the model:
</table>
</center>
> [<font color="#dd0000">2023.08.25 更新</font>] ModelScope发布1.8.4版本I2VGen-XL模型更新到模型参数文件 v1.1.0;
### 依赖项 (Dependency)
首先你需要确定你的系统安装了*ffmpeg*命令,如果没有,可以通过以下命令来安装:
首先你需要确定你的系统安装了`ffmpeg`命令,如果没有,可以通过以下命令来安装:
First, you need to ensure that your system has installed the ffmpeg command. If it is not installed, you can install it using the following command:
First, you need to ensure that your system has installed the `ffmpeg` command. If it is not installed, you can install it using the following command:
```bash
sudo apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
```
其次,本**MS-Image2Video**项目适配ModelScope代码库以下是本项目需要安装的部分依赖项。
其次,本**Image-to-Video**项目适配ModelScope代码库以下是本项目需要安装的部分依赖项。
The **MS-Image2Video** project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project.
The **Image-to-Video** project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project.
```bash
pip install modelscope==1.4.2
pip install -U xformers
pip install modelscope==1.8.4
pip install xformers==0.0.20
pip install torch==2.0.1
pip install open_clip_torch>=2.0.2
pip install opencv-python-headless
@ -304,6 +303,7 @@ pip install fairscale
pip install scipy
pip install imageio
pip install pytorch-lightning
pip install torchsde
```
@ -319,36 +319,59 @@ For more experiments, please stay tuned for our upcoming technical report and op
from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys
pipe = pipeline("image-to-video", 'damo/Image-to-Video')
pipe = pipeline(task="image-to-video", model='damo/Image-to-Video', model_revision='v1.1.0', device='cuda:0')
# IMG_PATH: your image path (url or local file)
output_video_path = pipe(IMG_PATH, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
print(output_video_path)
```
如果想生成超分视频的话, 示例见下:
If you want to generate high-resolution video, please use the following code:
```python
from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys
# if you only have one GPU, please make it's GPU memory bigger than 50G, or you can use two GPUs, and set them by device
pipe1 = pipeline(task='image-to-video', model='damo/Image-to-Video', model_revision='v1.1.0', device='cuda:0')
pipe2 = pipeline(task='video-to-video', model='damo/Video-to-Video', model_revision='v1.1.0', device='cuda:0')
# image to video
output_video_path = pipe1("test.jpg", output_video='./i2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]
# video resolution
p_input = {'video_path': output_video_path}
new_output_video_path = pipe2(p_input, output_video='./v2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]
```
更多超分细节, 请访问 <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a>。 我们也提供了用户接口,请移步<a href="https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary">I2VGen-XL-Demo</a>
Please visit <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a> for more details. We also provide user interface:<a href="https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary">I2VGen-XL-Demo</a>.
### 模型局限 (Limitation)
本**MS-Image2Video**项目的模型在处理以下情况会存在局限性:
目前,我们发现**Image-to-Video**方法在处理以下情况会存在一定的局限性:
- 小目标生成能力有限,在生成较小目标的时候,会存在一定的错误
- 快速运动目标生成能力有限,当生成快速运动目标时,会存在一定的假象
- 快速运动目标生成能力有限,当生成快速运动目标时,可能会出现一些假象和不合理的情况
- 生成速度较慢,生成高清视频会明显导致生成速度减慢
此外,我们研究也发现,生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象,在本项目我们选择了其折中的模型,兼顾两的平衡。
此外,我们研究也发现,生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象,在本项目我们选择了其折中的模型,兼顾两者间的平衡。
The model of the **MS-Image2Video** project has limitations in the following scenarios:
- Limited ability to generate small objects: There may be some errors when generating smaller objects.
- Limited ability to generate fast-moving objects: There may be some artifacts when generating fast-moving objects.
- Slow generation speed: Generating high-definition videos significantly slows down the generation speed.
Currently, we have found certain limitations of the Image-to-Video method in handling the following situations:
- Limited ability to generate small objects. There may be some errors when generating smaller objects.
- Limited ability to generate fast-moving objects. There may be some artifacts when generating fast-moving objects.
- Slow generation speed. Generating high-definition videos significantly slows down the generation speed.
Additionally, our research has found that there is a trade-off between the spatial quality and temporal variability of the generated videos. In this project, we have chosen a model that strikes a balance between the two.
*如果您正在尝试使用我们的模型,我们建议您首先在使用第一阶段得到满意的符合语义的视频之后,再尝试第二阶段的调整(因为该过程比较耗时),这样可以提高您的使用效率,更容易得到更好的结果。*
**如果您正在尝试使用我们的模型,我们建议您首先在第一阶段得到语义符合预期的视频后(离线运行的时候可以修改`configuration.json`文件中的`Seed`生成不同视频),再尝试第二阶段的视频修正(因为该过程比较耗时),这样可以提高您的使用效率,更容易得到更好的结果。**
**If you are trying to use our model, we suggest that you first obtain semantic-expected videos in the first stage (you can modify the `Seed` in the `configuration.json` file when running offline to generate different videos). Then, you can try video refining in the second stage (as this process takes more time). This will improve your efficiency and make it easier to achieve better results.**
*If you are trying to use our model, we recommend that you first focus on obtaining satisfactory semantic-consistent videos using the first stage before attempting adjustments in the second stage (as this process can be time-consuming). This approach will improve your efficiency and increase the likelihood of achieving better results.*
## 训练数据介绍 (Training Data)
@ -362,15 +385,15 @@ Additionally, our research has found that there is a trade-off between the spati
Our training data mainly comes from various sources and has the following attributes:
- Mixed training: The model is trained with a 7:1 ratio of video to image to ensure the quality of video generation.
- Wide class distribution: The data set covers most real-world categories, including people, animals, locomotives, science fiction, scenes, etc. with a total volume of billions of data points.
- Wide source distribution: The data comes from open-source data, video websites, and other internal sources, with varying resolutions and aspect ratios.
- High-quality data construction: To improve the quality of the model-generated videos, we constructed approximately 200,000 high-quality data pairs for fine-tuning the pre-training model.
- Mixed training. The model is trained with a 7:1 ratio of video to image to ensure the quality of video generation.
- Wide class distribution. The data set covers most real-world categories, including people, animals, locomotives, science fiction, scenes, etc. with a total volume of billions of data points.
- Wide source distribution. The data comes from open-source data, video websites, and other internal sources, with varying resolutions and aspect ratios.
- High-quality data construction. To improve the quality of the model-generated videos, we constructed approximately 200,000 high-quality data pairs for fine-tuning the pre-training model.
相关的技术文档正在撰写中,欢迎及时关注。
更强更灵活的视频生成模型会持续发布,及其背后技术报告正在撰写中,欢迎及时关注。
The relevant technical report is currently being written, and we welcome you to stay tuned for updates.
More powerful models will continue to be released, and the technical report behind them are currently being written. Please stay tuned for updates and timely information.
## 相关论文以及引用信息 (Reference)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.5 MiB

After

Width:  |  Height:  |  Size: 1.6 MiB