Video to Music Moment Retrieval

Zijie Xin♪ *, Minquan Wang, Ye Ma, Bo Wang, Quan Chen, Peng Jiang, Xirong Li
Renmin University of China         Kuaishou Technology
* Work done during internship at Kuaishou Technology
MY ALT TEXT

Proposed Video-to-Music Moment Retrieval (VMMR) task versus the conventional video-to-music retrieval (VMR) task.

Abstract

Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music has to be cut to a shorter moment, there is a clear gap between the practical need and VMR. In order to bridge the gap, we propose in this paper video to music moment retrieval (VMMR) as a new task. To tackle the new task, we build a comprehensive dataset Ad-Moment which contains 50K short videos annotated with music moments and develop a two-stage approach. In particular, given a test video, the most similar music is retrieved from a given collection. Then, a Transformer based music moment localization is performed. We term this approach Retrieval and Localization (ReaL). Extensive experiments on real-world datasets verify the effectiveness of the proposed method for VMMR.

Dataset: Ad-Moment

We gather and clean approximately 50k short advertising videos with their corresponding music data, and propose an automated weakly-supervised pipeline for generating music moment timestamp annotations.

Conceptual diagram of the weakly-supervised multi-modal timestamp collection pipeline.

Overview of the Ad-Moment dataset.

Method: ReaL Framework

Stage I: Video-to-Music Retrieval

llustration of the retrieval model in stage I.

Stage II: Music Moment Localization

Illustration of the proposed video-music moment localization model Music-DETR, which is composed of music/video temporal modeling, cross-modal fusion encoder, and DETR-based decoder. The decoder, following the DETR, performs the moment localization task. We use video embeddings to initialize the moment queries, enabling the prediction of the span range, moment classification, and moment embedding. Additionally, we optimize the alignment between the video and the moment embeddings with audio auxiliary to further constrain the training process and improve performance.

Comparison with Other Methods

Music Moment Localization (MML) results

Video-to-Music Moment Retrieval (VMMR) results

Data Display

Input Video

Video duration: 12.80s

Retrieved Music

 

Ground truth music
Candidate music rank1
Candidate music rank2
Candidate music rank3
Candidate music rank4

Located Moment

 

Ground truth moment (47.0-59.8s)
Moment rank1 (0.01-12.7s)
Moment rank2 (0.08-12.7s)
Moment rank3 (0.07-12.7s)
Moment rank4 (0.17-30.5s)

Video duration: 30.31s

 

Ground truth music
Candidate music rank1
Candidate music rank2
Candidate music rank3

 

Ground truth moment (10.07-40.4s)
Moment rank1 (60.47-90.9s)
Moment rank2 (3.82-34.0s)
Moment rank3 (41.44-72.4s)

Video duration: 15.60s

 

Ground truth music
Candidate music rank1
Candidate music rank2
Candidate music rank3

 

Ground truth moment (0-15.6s)
Moment rank1 (0.02-15.3s)
Moment rank2 (0.07-15.5s)
Moment rank3 (0.06-15.5s)

Qualitative results of Music Moment Localization (MML)