We present a system for finding celebrities in videos that uses face information in conjunction with text or speech. We achieve an approximate tripling of precision for searches over the use of transcripts or speech alone. Our work is motivated by the recent growth of personal video recording devices such as TiVo, which makes watching television more like information retrieval. We use a large dataset consisting of 13.5 hours of commercial video, which presents a challenging speech and face recognition environment. Faces are extracted using a face detector and processed via kernel PCA, LDA for use in one-vs-many SVM face classifiers. We evaluate two scenarios, one where transcripts are provided and the other more difficult scenario with speech as the only language cue. Wordspotting over audio is done using an HMM and SVM combination. We demonstrate our system's improved retrieval under realistic conditions using video recorded directly from television.




Download Full History