This paper describes a system that can annotate a video sequence with: a description of the appearance of each actor; when the actor is in view; and a representation of the actor's activity while in view. The system does not require a fixed background, and is automatic. The system works by tracking people in 2D, lifting the tracks to 3D and then classifying the lifted tracks by comparison with a set of manually annotated human motions. The tracker clusters potential body segments to build an appearance model of each actor and then identifies the best match to each model in each frame. The lifting process uses a scaled orthographic camera model combined with a camera motion model to identify the best matching 3D motion example. Finally, this example is used to identify the activity of the body. Activities are classified by matching to a collection of motion capture data that has been annotated by hand, using a class structure that describes everyday motions and allows motion annotations to be composed -- one may jump while running, for example. Descriptions computed from video of real motions show that the method is accurate.




Download Full History