D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching

1MoE Key Lab of DEKE, Renmin University of China          2Kuaishou Technology
*Corresponding Author

Illustration of video decoration with sound effects (VDSFX), aiming to automatically add proper SFX to key moments, which are also auto-detected, in a given E-commerce video. Note that Moment-DETR+ and $\text{R}^2\text{-Tuning}$+ are baselines we have implemented, by re-purposing Moment-DETR and $\text{R}^2\text{-Tuning}$ for the new task, with their detected moments used for moment-to-SFX matching.

D&M is our proposed method, achieving superior performance.

Abstract

Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing the user engaging experience. Previous studies about adding SFX to videos perform video to SFX matching at a holistic level, lacking the ability of adding SFX to a specific moment. Meanwhile, previous studies on video highlight detection or video moment retrieval consider only moment localization, leaving moment to SFX matching untouched. By contrast, we propose in this paper D&M, a unified method that accomplishes key moment detection and moment to SFX matching simultaneously. Moreover, for the new VDSFX task we build a large-scale dataset SFX-Moment from an E-commerce platform. For a fair comparison, we build competitive baselines by extending a number of current video moment detection methods to the new task. Extensive experiments on SFX-Moment show the superior performance of the proposed method over the baselines.

Framework

Conceptual diagram of our proposed D&M method for VDSFX. The video as an example consists of $n=30$ frames with $m=9$ subtitles. "SFX0" is a special token indicating "no SFX". The ASR module and the visual / textual / audio backbones, ie ViT / RoBERTa / AST, are all frozen. Non-trainable blocks are shown in gray.

Dataset: SFX-Moment

Collection pipeline of our SFX-Moment dataset.

Basic statistics of our SFX-Moment dataset. Note that a specific sound effect from a given SFX set can be applied to multiple videos. We consider a closed set of SFX.

Visualization of data statistics of SFX-Moment.

Quantitative results of D&M

Qualitative results of D&M

Video Decoration Result