Long-form video understanding remains as one of the last enduring open problems in computer vision. While the natural world offers long periods of visual stimuli, most computer vision systems still operate within a limited temporal scope, typically just a few seconds in both input and output. This thesis presents my work developing the neural machinery, i.e., the algorithms, architectures and datasets, that extend the temporal capacity of video understanding systems to minutes and beyond. I start by presenting my work on algorithms for long-term multimodal human motion forecasting, termed PECNet and Y-net. Next, I introduce my contributions on neural architectures for hierarchical, temporally scalable and memory-efficient neural architectures for understanding long-form videos in form of MViT and Rev-ViT. Finally, I close by presenting my work on EgoSchema, the first certifiably long-form video-language dataset, which serves as a benchmark for evaluating the long-form understanding capabilities of multimodal models. The presented benchmark results on EgoSchema highlight the existing performance gap between current state-of-the-art models and human-level long-form video understanding. I believe that my presented advancements in algorithms, architectures, and datasets not only address several existing limitations but also open new avenues for future research and application.




Download Full History