Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation

Accepted by International Conference on Computer Vision (ICCV'21)

Jiabo Huang Yang Liu Shaogang Gong Hailin Jin

Queen Mary University of London Peking University Adobe Research



Video activity localisation by natural language is an important yet challenging task, which aims to localise temporally a video segment that best corresponds to a query sentence in an untrimmed (and often unstructured) video. Most of the existing methods address this task in a fully supervised manner to learn to localise moment-of-interest (MoI) in videos according to their precise start and end time indices. Considering the high annotation cost and subjective annotation bias, recent works focus on weakly-supervised learning without per-sentence temporal boundary annotations in training.

Existing weakly-supervised solutions localise different MoIs individually, which is not optimal as it neglects the fact that the cross-sentence relations in a paragraph play an important role in temporally localising multiple MoIs. Critically, an individual sentence is sometimes ambiguous out of its paragraph context and the MoIs described by a paragraph are often semantically related to each other in their corresponding sentences.

In this work, we introduce a weakly-supervised method for video activity localisation by natural language called Cross-sentence Relations Mining (CRM). The key idea is to explore the cross-sentence relations in a paragraph as constraints to better interpret and match complex moment-wise temporal and semantic relations in videos. Specifically, by assuming different activities in videos are described sequentially, we formulate a temporal consistency constraint to encourage the selected moments to be temporally ordered according to their descriptions in a paragraph. Moreover, we encourage moment proposal selections to satisfy cross-sentence broader semantics in context to minimise video-text matching ambiguities. To that end, we introduce a semantic consistency constraint to ensure that a moment selected for any pairing of two sentences (concatenation) in a paragraph is consistent (overlapping) with the union of the selected segments per sentence.

Our contributions are: (1) To our best knowledge, this is the first idea to develop a model using cross-sentence relations in a paragraph to explicitly represent and compute cross-moment relations in videos, so as to alleviate the ambiguity of each individual sentence in video activity localisation. (2) We formulate a new weakly-supervised method for activity localisation by natural language called Cross-sentence Relations Mining (CRM), that trains a model with both temporal and semantic cross-sentence relations to improve per-sentence temporal boundary prediction in testing. (3) Our approach achieves the state-of-the-art performance on two available activity localisation benchmarks, especially so given more complex query descriptions.


Experiments were conducted on two challenging video activity localisation benchmarks which demonstrate the compelling multi-modal understanding ability of CRM over a wide range of the state-of-the-art approaches.

Please kindly refer to the paper for more details and feel free to reach me for any question.