Video to Music Moment Retrieval

Zijie Xin^{♪ *}, Minquan Wang^♫, Ye Ma^♫, Bo Wang^♫, Quan Chen^♫, Peng Jiang^♫, Xirong Li^♪

^♪ Renmin University of China ^♫ Kuaishou Technology
^* Work done during internship at Kuaishou Technology

Proposed Video-to-Music Moment Retrieval (VMMR) task versus the conventional video-to-music retrieval (VMR) task.

Dataset: Ad-Moment

We gather and clean approximately 50k short advertising videos with their corresponding music data, and propose an automated weakly-supervised pipeline for generating music moment timestamp annotations.

Conceptual diagram of the weakly-supervised multi-modal timestamp collection pipeline.

Overview of the Ad-Moment dataset.

Method: ReaL Framework

Stage I: Video-to-Music Retrieval

llustration of the retrieval model in stage I.

Stage II: Music Moment Localization

Illustration of the proposed video-music moment localization model Music-DETR, which is composed of music/video temporal modeling, cross-modal fusion encoder, and DETR-based decoder. The decoder, following the DETR, performs the moment localization task. We use video embeddings to initialize the moment queries, enabling the prediction of the span range, moment classification, and moment embedding. Additionally, we optimize the alignment between the video and the moment embeddings with audio auxiliary to further constrain the training process and improve performance.