Summary:
This paper examines the question of whether gestures without a relationship to a specific meaning are speaker independent or whether gestures are determined by the speaker's view of a discourse topic. To answer the question, a new characterization of gestures in terms of spatiotemporal interest points is used to extract a low level representation of gestures from video without manual intervention. The interest points are grouped into categories of codewords. The distribution of these codewords across speakers and topics form the basis for analysis.
During the spatiotemporal interest point extraction from images, principal component analysis is used to reduce the dimensionality of the data to form the final feature vectors describing the gestures. A hierarchical Bayesian model was used to cluster gesture features and determine whether the gesture is related to the speaker or the discourse topic. To allow inference of the hidden variables in the Bayesian model, Gibbs sampling is used since it is guaranteed to converge to the true distribution over the hidden variables in the limit.
An experiment was conducted to analyze 33 short videos of fifteen different speakers covering five different topics. Four of the topics of discussion were chosen to be mechanical devices, since direct gestural metaphors would likely show up in the discussion of such items. The study concluded that gestures are topic dependent and the relationship overshadows specific speakers. With correct labeling of topics, 12% of gestures were related to the topic, but randomly labeled topics, less than 3% of gestures are considered topic-specific.
Discussion:
The interest point features that were extracted exist at a lower level of abstraction than most other attempts at describing gestures. It is interesting that such low-level attributes are sufficient to perform successful recognition.
I think the limitation to mainly mechanical devices as topics could have influenced the ability to spot topic-specific gestures. If general categories were chosen, I think the topic-specific nature of gestures would be reduced.
I was not familiar with all of the mathematics discussed in the paper, so I'm not sure how sound their technique is from a theoretical point of view. Looking at the pictures of identified interest points and the frames before and after the interest point convinced me that the feature extraction must work in some cases.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment