llustration of the retrieval model in stage I.

Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music has to be cut to a shorter moment, there is a clear gap between the practical need and VMR. In order to bridge the gap, we propose in this paper video to music moment retrieval (VMMR) as a new task. To tackle the new task, we build a comprehensive dataset Ad-Moment which contains 50K short videos annotated with music moments and develop a two-stage approach. In particular, given a test video, the most similar music is retrieved from a given collection. Then, a Transformer based music moment localization is performed. We term this approach Retrieval and Localization (ReaL). Extensive experiments on real-world datasets verify the effectiveness of the proposed method for VMMR.
We gather and clean approximately 50k short advertising videos with their corresponding music data, and propose an automated weakly-supervised pipeline for generating music moment timestamp annotations.
llustration of the retrieval model in stage I.
Illustration of the proposed video-music moment localization model Music-DETR, which is composed of music/video temporal modeling, cross-modal fusion encoder, and DETR-based decoder. The decoder, following the DETR, performs the moment localization task. We use video embeddings to initialize the moment queries, enabling the prediction of the span range, moment classification, and moment embedding. Additionally, we optimize the alignment between the video and the moment embeddings with audio auxiliary to further constrain the training process and improve performance.
Music Moment Localization (MML) results
Video-to-Music Moment Retrieval (VMMR) results
Video duration: 12.80s
 
Ground truth music
Candidate music rank1
Candidate music rank2
Candidate music rank3
Candidate music rank4
 
Ground truth moment (47.0-59.8s)
Moment rank1 (0.01-12.7s)
Moment rank2 (0.08-12.7s)
Moment rank3 (0.07-12.7s)
Moment rank4 (0.17-30.5s)
Video duration: 30.31s
 
Ground truth music
Candidate music rank1
Candidate music rank2
Candidate music rank3
 
Ground truth moment (10.07-40.4s)
Moment rank1 (60.47-90.9s)
Moment rank2 (3.82-34.0s)
Moment rank3 (41.44-72.4s)
Video duration: 15.60s
 
Ground truth music
Candidate music rank1
Candidate music rank2
Candidate music rank3
 
Ground truth moment (0-15.6s)
Moment rank1 (0.02-15.3s)
Moment rank2 (0.07-15.5s)
Moment rank3 (0.06-15.5s)