Music Grounding by Short Video

This paper introduces Music Grounding by Short Video (MGSV), a new task that aims to localize background music segments suitable for short videos, supported by a large-scale real-world benchmark, and proposes a unified baseline model that jointly performs music matching and moment localization.

Adding proper background music helps complete a short video to be shared. Previous work tackles the task by video-to-music retrieval (V2MR), aiming to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53k short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unified end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also set MaDe as a strong baseline.

In order to develop and evaluate solutions for the new MGSV task, we build MGSV-EC, a dataset consisting of 53k professionally made E-commerce videos, with their BGMs extracted from a set of 4k unique music tracks.

We preserve 53,194 short videos accompanied with 35,393 different music moments from 4,050 unique music tracks. Among the tracks, songs and instrumental music are approximately in a 6:4 ratio.
The video duration is 23.9 seconds on average, whilst that of the music tracks is 138.9 seconds, indicating a noticeable duration gap between the query videos and the music tracks to be retrieved.

**Visualization of MGSV-EC.** Note that the video tag cloud is merely to demonstrate the richness of the visual content.

For reproducible research, we recommend a data split as follows. A set of 2k videos is randomly sampled to form a held-out test set, while another random set of 2k videos is used as a validation set (for hyper-parameter tuning, model selection, etc.), with the remaining data used for training.

Our benchmark supports two evaluation modes, i.e. single-music, wherein the music track relevant w.r.t. to a given query video is known in advance, and music-set, wherein the relevant music track has to be retrieved from a given music track set.

The overall structure of our network is illustrated below. Conceptually, MaDe consists of three modules:

Multimodal Embedding that converts the given query video and music track to a sequence of visual and audio tokens, respectively. It consists of two steps, Raw Feature Extraction and Unimodal Feature Enhancement.
Video-Music Matching which estimates the relevance of the music track w.r.t. the video based on their holistic embeddings.
Music Moment Detection that localizes within the music track the moment best fitting the video.

At a high level, given a query video \(v\) and a music track \(M\) as multi-modal input, MaDe yields a three-dimensional output \((p_s, p_c, p_w)\), where \(p_s\) measures the relevance of \(M\) w.r.t. \(v\), whilst \(p_c\) / \(p_w\) indicates the center / width of the detected music moment. The video is initially represented by a sequence of \(F\) evenly sampled frames, and the music track as a sequence of \(S\) partially overlapped Mel-spectrogram segments. Pre-trained CLIP (ViT-B/32) and Audio Spectrogram Transformer (AST) are used as our (weight-frozen) video and audio encoders, respectively. Each encoder is followed by an FC layer to compress frame / audio embeddings to a smaller size of \(d\)(=256) for cross-modal feature alignment and for parameter reduction. In the single-music mode where \(M\) is known to be relevant w.r.t. \(v\), \(p_s\) will be ignored. In the music-set mode, we rank all candidate music tracks by their \(p_s\) and accordingly select the moment detected from the top-ranked music track as the grounding result.

For a fair and reproducible comparison in single-music grounding (SmG) task, we opt for SOTA video temporal grounding models that are DETR-based and open-source.
To tackle the MsG task, a straightforward solution is to combine UVCOM+ with an existing V2MR model, like MVPt.

**Overall results.** #Params excludes the (weights-frozen) video / audio encoders.

We further conduct a human evaluation on a random subset of 100 test videos to subjectively assess the quality of the MsG results.

Generated BGM Comparisons

Original background music

BGM generated by MaDe (ours)

Composite solution

For a better understanding of the superior performance of MaDe, an ablation study is conducted as follows.

We conduct a controlled comparison at varied training data scales. This result suggests that CA-based multimodal feature fusion is the source of incompatibility with the MGSV task’s demands.

**SmG performance with varying training data sizes.**

BibTeX

@inproceedings{xin2025mgsv,
  title={Music Grounding by Short Video},
  author={Xin, Zijie and Wang, Minquan and Liu, Jingyu and Chen, Quan and Ma, Ye and Jiang, Peng and Li, Xirong},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Acknowledgements

This research was supported by NSFC (No.62172420) and Kuaishou. We thank Yingtong Liu, Yuchuan Deng, Yiyi Chen, and Bo Wang for valuable discussion and feedback on this research.

Music Grounding by Short Video

ICCV 2025

TL;DR

Abstract

Dataset for MGSV: MGSV-EC

A Strong Baseline for MGSV: MaDe

Evaluation

Misc.

BibTeX

Acknowledgements