Music Grounding by Short Video

Zijie Xin Minquan Wang Jingyu Liu Ye Ma Quan Chen Peng Jiang Xirong Li♪,†
Renmin University of China        Kuaishou Technology

ICCV 2025

MY ALT TEXT Music grounding by short video (MGSV), aiming to localize within a music-track collection a music moment that best serves as background music for the query video. Text in callout boxes to the right of each music moment is for illustrative purposes only.

TL;DR

This paper introduces Music Grounding by Short Video (MGSV), a new task that aims to localize background music segments suitable for short videos, supported by a large-scale real-world benchmark, and proposes a unified baseline model that jointly performs music matching and moment localization.
BGM Showcase Generated by our Model MaDe

Abstract

Adding proper background music helps complete a short video to be shared. Previous work tackles the task by video-to-music retrieval (V2MR), aiming to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53k short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unified end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also set MaDe as a strong baseline.

Dataset for MGSV: MGSV-EC

In order to develop and evaluate solutions for the new MGSV task, we build MGSV-EC, a dataset consisting of 53k professionally made E-commerce videos, with their BGMs extracted from a set of 4k unique music tracks.
Dataset Curation
We preserve 53,194 short videos accompanied with 35,393 different music moments from 4,050 unique music tracks. Among the tracks, songs and instrumental music are approximately in a 6:4 ratio.
The video duration is 23.9 seconds on average, whilst that of the music tracks is 138.9 seconds, indicating a noticeable duration gap between the query videos and the music tracks to be retrieved.
Visualization of MGSV-EC
Visualization of MGSV-EC. Note that the video tag cloud is merely to demonstrate the richness of the visual content.
For reproducible research, we recommend a data split as follows. A set of 2k videos is randomly sampled to form a held-out test set, while another random set of 2k videos is used as a validation set (for hyper-parameter tuning, model selection, etc.), with the remaining data used for training.
Data Split
Overview of the MGSV-EC dataset. While the query videos have no overlap between train / val. / test, music tracks are partially shared across the data split. As the music tracks are meant for re-using, such a setup is practical.
Evaluation Protocol
Our benchmark supports two evaluation modes, i.e. single-music, wherein the music track relevant w.r.t. to a given query video is known in advance, and music-set, wherein the relevant music track has to be retrieved from a given music track set.
Evaluation modes
Evaluation modes, (sub-)tasks and metrics.

A Strong Baseline for MGSV: MaDe

The overall structure of our network is illustrated below. Conceptually, MaDe consists of three modules:
  1. Multimodal Embedding that converts the given query video and music track to a sequence of visual and audio tokens, respectively. It consists of two steps, Raw Feature Extraction and Unimodal Feature Enhancement.
  2. Video-Music Matching which estimates the relevance of the music track w.r.t. the video based on their holistic embeddings.
  3. Music Moment Detection that localizes within the music track the moment best fitting the video.
MaDe
Conceptual diagram of the proposed MaDe network for MGSV.
At a high level, given a query video \(v\) and a music track \(M\) as multi-modal input, MaDe yields a three-dimensional output \((p_s, p_c, p_w)\), where \(p_s\) measures the relevance of \(M\) w.r.t. \(v\), whilst \(p_c\) / \(p_w\) indicates the center / width of the detected music moment. The video is initially represented by a sequence of \(F\) evenly sampled frames, and the music track as a sequence of \(S\) partially overlapped Mel-spectrogram segments. Pre-trained CLIP (ViT-B/32) and Audio Spectrogram Transformer (AST) are used as our (weight-frozen) video and audio encoders, respectively. Each encoder is followed by an FC layer to compress frame / audio embeddings to a smaller size of \(d\)(=256) for cross-modal feature alignment and for parameter reduction. In the single-music mode where \(M\) is known to be relevant w.r.t. \(v\), \(p_s\) will be ignored. In the music-set mode, we rank all candidate music tracks by their \(p_s\) and accordingly select the moment detected from the top-ranked music track as the grounding result.

Evaluation

Grounding Results
For a fair and reproducible comparison in single-music grounding (SmG) task, we opt for SOTA video temporal grounding models that are DETR-based and open-source.
To tackle the MsG task, a straightforward solution is to combine UVCOM+ with an existing V2MR model, like MVPt.
Overall results. #Params excludes the (weights-frozen) video / audio encoders.
We further conduct a human evaluation on a random subset of 100 test videos to subjectively assess the quality of the MsG results.
Human evaluation results.
Generated BGM Comparisons
Original background music
BGM generated by MaDe (ours)
Composite solution
Ablation Study
For a better understanding of the superior performance of MaDe, an ablation study is conducted as follows.
Ablation study of MaDe.
We conduct a controlled comparison at varied training data scales. This result suggests that CA-based multimodal feature fusion is the source of incompatibility with the MGSV task’s demands.
SmG performance with varying training data sizes.

Misc.

BibTeX

@inproceedings{xin2025mgsv,
  title={Music Grounding by Short Video},
  author={Xin, Zijie and Wang, Minquan and Liu, Jingyu and Chen, Quan and Ma, Ye and Jiang, Peng and Li, Xirong},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Acknowledgements

This research was supported by NSFC (No.62172420) and Kuaishou. We thank Yingtong Liu, Yuchuan Deng, Yiyi Chen, and Bo Wang for valuable discussion and feedback on this research.
Flag Counter