Go to main content

PDF

Description

Self-supervised video representation learning algorithms, such as pretext task learning, contrastive learning, and multimodal learning, have made significant progress in extracting features that generalize well to downstream video benchmarks. All of these learning algorithms rely on the underlying view transforms and research on how view transformations impact the performance of these learning algorithms has not been thoroughly explored. In this work, we investigate the effect of many different spatial, temporal, and visual view transforms on pretext task learning and contrastive learning. We provide a detailed analysis of the performance of these methods on video action recognition, and investigate how different methods compare by combining the learned features of several models pretrained using different learning algorithms and/or view transforms. In our setup, certain combinations of pretraining algorithms and view transforms perform better than supervised training alone on the UCF-101 and HMDB action recognition datasets but underperform some of the current state-of-the-art methods.

Details

Files

Statistics

from
to
Export
Download Full History
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS