Update README.md

This commit is contained in:
FaceZhao
2023-08-25 06:51:51 +00:00
parent ba65b78a0a
commit b15a0a3091

View File

@ -41,15 +41,15 @@ widgets:
task: image-to-video task: image-to-video
--- ---
# Image-to-Video # Image-to-Video (I2VGen-XL)
本项目**MS-Image2Video**旨在解决根据输入图像生成高清视频任务。**MS-Image2Video**由达摩院研发的高清视频生成基础模型其核心部分包含两个阶段分别解决语义一致性和清晰度的问题参数量共计约37亿模型经过在大规模视频和图像数据混合预训练并在少量精品数据上微调得到该数据分布广泛、类别多样化模型对不同的数据均有良好的泛化性。项目于现有的视频生成模型**MS-Image2Video**在清晰度、质感、语义、时序连续性等方面均具有明显的优势。 本项目**I2VGen-XL**旨在解决根据输入图像生成高清视频任务。**I2VGen-XL**由达摩院研发的高清视频生成基础模型其核心部分包含两个阶段分别解决语义一致性和清晰度的问题参数量共计约37亿模型经过在大规模视频和图像数据混合预训练并在少量精品数据上微调得到该数据分布广泛、类别多样化模型对不同的数据均有良好的泛化性。项目于现有的视频生成模型**I2VGen-XL**在清晰度、质感、语义、时序连续性等方面均具有明显的优势。
此外,**MS-Image2Video**的许多设计理念继承于我们已经公开的工作**VideoComposer**,您可以参考我们的[VideoComposer](https://videocomposer.github.io)和本项目的Github代码库了解详细细节 此外,**I2VGen-XL**的许多设计理念继承于我们已经公开的工作**VideoComposer**,您可以参考我们的[VideoComposer](https://videocomposer.github.io)和本项目的Github代码库了解详细细节
The **MS-Image2Video** project aims to address the task of generating high-definition videos based on input images. Developed by Alibaba Cloud, the **MS-Image2Video** is a fundamental model for generating high-definition videos. Its core components consist of two stages that address the issues of semantic consistency and clarity, totaling approximately 3.7 billion parameters. The model is pre-trained on a large-scale mix of video and image data and fine-tuned on a small number of high-quality data sets with a wide range of distributions and diverse categories. The model demonstrates good generalization capabilities for different data types. Compared to existing video generation models, **MS-Image2Video** has significant advantages in terms of clarity, texture, semantics, and temporal continuity. The **I2VGen-XL** project aims to address the task of generating high-definition videos based on input images. Developed by Alibaba Cloud, the **I2VGen-XL** is a fundamental model for generating high-definition videos. Its core components consist of two stages that address the issues of semantic consistency and clarity, totaling approximately 3.7 billion parameters. The model is pre-trained on a large-scale mix of video and image data and fine-tuned on a small number of high-quality data sets with a wide range of distributions and diverse categories. The model demonstrates good generalization capabilities for different data types. Compared to existing video generation models, **MS-Image2Video** has significant advantages in terms of clarity, texture, semantics, and temporal continuity.
Additionally, many of the design concepts for **MS-Image2Video** are inherited from our publicly available work, **VideoComposer**. For detailed information, please refer to our [VideoComposer](https://videocomposer.github.io) and the Github code repository for this project. Additionally, many of the design concepts for **I2VGen-XL** are inherited from our publicly available work, **VideoComposer**. For detailed information, please refer to our [VideoComposer](https://videocomposer.github.io) and the Github code repository for this project.
<center> <center>
<p align="center"> <p align="center">
@ -61,7 +61,7 @@ Additionally, many of the design concepts for **MS-Image2Video** are inherited f
## 模型介绍 (Introduction) ## 模型介绍 (Introduction)
**MS-Image2Video**建立在Stable Diffusion之上如图Fig.2所示通过专门设计的时空UNet在隐空间中进行时空建模并通过解码器重建出最终视频。为能够生成720P视频我们将**MS-Image2Video**分为两个阶段第一阶段保证语义一致性但低分辨率第二阶段通过DDIM逆运算并在新的VLDM上进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、训练和数据上的联合优化本项目主要具有以下几个特点 **I2VGen-XL**建立在Stable Diffusion之上如图Fig.2所示通过专门设计的时空UNet在隐空间中进行时空建模并通过解码器重建出最终视频。为能够生成720P视频我们将**I2VGen-XL**分为两个阶段第一阶段保证语义一致性但低分辨率第二阶段通过DDIM逆运算并在新的VLDM上进行去噪以提高视频分辨率以及同时提升时间和空间上的一致性。通过在模型、训练和数据上的联合优化本项目主要具有以下几个特点
- 高清&宽屏可以直接生成720P(1280*720)分辨率的视频,且相比于现有的开源项目,不仅分辨率得到有效提高,其生产的宽屏视频可以适合更多的场景 - 高清&宽屏可以直接生成720P(1280*720)分辨率的视频,且相比于现有的开源项目,不仅分辨率得到有效提高,其生产的宽屏视频可以适合更多的场景
- 无水印,模型通过我们内部大规模无水印视频/图像训练,并在高质量数据微调得到,生成的无水印视频可适用更多视频平台,减少许多限制 - 无水印,模型通过我们内部大规模无水印视频/图像训练,并在高质量数据微调得到,生成的无水印视频可适用更多视频平台,减少许多限制
@ -70,7 +70,7 @@ Additionally, many of the design concepts for **MS-Image2Video** are inherited f
以下为生成的部分案例: 以下为生成的部分案例:
**MS-Image2Video** is built on Stable Diffusion, as shown in Fig.2, and uses a specially designed spatiotemporal UNet to perform spatiotemporal modeling in the latent space, and then reconstructs the final video through the decoder. In order to generate 720P videos, **MS-Image2Video** is divided into two stages. The first stage guarantees semantic consistency but with low resolution, while the second stage uses the DDIM inverse operation and applies denoising on a new VLDM to improve the resolution and spatiotemporal consistency of the video. Through joint optimization of the model, training, and data, this project has the following characteristics: **I2VGen-XL** is built on Stable Diffusion, as shown in Fig.2, and uses a specially designed spatiotemporal UNet to perform spatiotemporal modeling in the latent space, and then reconstructs the final video through the decoder. In order to generate 720P videos, **I2VGen-XL** is divided into two stages. The first stage guarantees semantic consistency but with low resolution, while the second stage uses the DDIM inverse operation and applies denoising on a new VLDM to improve the resolution and spatiotemporal consistency of the video. Through joint optimization of the model, training, and data, this project has the following characteristics:
- High-definition & widescreen, can directly generate 720P (1280*720) resolution videos, and compared to existing open source projects, not only is the resolution effectively improved, but the widescreen videos it produces can also be suitable for more scenarios. - High-definition & widescreen, can directly generate 720P (1280*720) resolution videos, and compared to existing open source projects, not only is the resolution effectively improved, but the widescreen videos it produces can also be suitable for more scenarios.
- No watermark, the model is trained on a large-scale watermark-free video/image dataset internally and fine-tuned on high-quality data, generating watermark-free videos that can be applied to more video platforms and reducing many restrictions. - No watermark, the model is trained on a large-scale watermark-free video/image dataset internally and fine-tuned on high-quality data, generating watermark-free videos that can be applied to more video platforms and reducing many restrictions.
@ -287,9 +287,9 @@ sudo apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
``` ```
其次,本**MS-Image2Video**项目适配ModelScope代码库以下是本项目需要安装的部分依赖项。 其次,本**I2VGen-XL**项目适配ModelScope代码库以下是本项目需要安装的部分依赖项。
The **MS-Image2Video** project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project. The **I2VGen-XL** project is compatible with the ModelScope codebase, and the following are some of the dependencies that need to be installed for this project.
```bash ```bash
@ -336,8 +336,9 @@ If you want to generate high-resolution video, please use the following code:
from modelscope.pipelines import pipeline from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys from modelscope.outputs import OutputKeys
pipe1 = pipeline(task='image-to-video', model='damo/Image-to-Video', model_revision='v1.1.0') # if you only have one GPU, please make it's GPU memory bigger than 50G, or you can use two GPUs, and set them by device
pipe2 = pipeline(task='video-to-video', model='damo/Video-to-Video', model_revision='v1.1.0') pipe1 = pipeline(task='image-to-video', model='damo/Image-to-Video', model_revision='v1.1.0', device='cuda:0')
pipe2 = pipeline(task='video-to-video', model='damo/Video-to-Video', model_revision='v1.1.0', device='cuda:0')
# image to video # image to video
output_video_path = pipe1("test.jpg", output_video='./i2v_output.mp4')[OutputKeys.OUTPUT_VIDEO] output_video_path = pipe1("test.jpg", output_video='./i2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]
@ -346,14 +347,14 @@ output_video_path = pipe1("test.jpg", output_video='./i2v_output.mp4')[OutputKey
p_input = {'video_path': output_video_path} p_input = {'video_path': output_video_path}
new_output_video_path = pipe2(p_input, output_video='./v2v_output.mp4')[OutputKeys.OUTPUT_VIDEO] new_output_video_path = pipe2(p_input, output_video='./v2v_output.mp4')[OutputKeys.OUTPUT_VIDEO]
``` ```
更多超分细节, 请访问 <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a> 更多超分细节, 请访问 <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a>我们也提供了用户接口,请移步<a href="https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary">I2VGen-XL-Demo</a>
Please visit <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a> for more details. Please visit <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">Video-to-Video</a> for more details. We also provide user interface:<a href="https://modelscope.cn/studios/damo/I2VGen-XL-Demo/summary">I2VGen-XL-Demo</a>.
### 模型局限 (Limitation) ### 模型局限 (Limitation)
本**MS-Image2Video**项目的模型在处理以下情况会存在局限性: 本**I2VGen-XL**项目的模型在处理以下情况会存在局限性:
- 小目标生成能力有限,在生成较小目标的时候,会存在一定的错误 - 小目标生成能力有限,在生成较小目标的时候,会存在一定的错误
- 快速运动目标生成能力有限,当生成快速运动目标时,会存在一定的假象 - 快速运动目标生成能力有限,当生成快速运动目标时,会存在一定的假象
- 生成速度较慢,生成高清视频会明显导致生成速度减慢 - 生成速度较慢,生成高清视频会明显导致生成速度减慢
@ -361,7 +362,7 @@ Please visit <a href="https://modelscope.cn/models/damo/Video-to-Video/summary">
此外,我们研究也发现,生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象,在本项目我们选择了其折中的模型,兼顾两则的平衡。 此外,我们研究也发现,生成的视频空间上的质量和时序上的变化速度在一定程度上存在互斥现象,在本项目我们选择了其折中的模型,兼顾两则的平衡。
The model of the **MS-Image2Video** project has limitations in the following scenarios: The model of the **I2VGen-XL** project has limitations in the following scenarios:
- Limited ability to generate small objects: There may be some errors when generating smaller objects. - Limited ability to generate small objects: There may be some errors when generating smaller objects.
- Limited ability to generate fast-moving objects: There may be some artifacts when generating fast-moving objects. - Limited ability to generate fast-moving objects: There may be some artifacts when generating fast-moving objects.
- Slow generation speed: Generating high-definition videos significantly slows down the generation speed. - Slow generation speed: Generating high-definition videos significantly slows down the generation speed.