Summary:
This paper creates a systme that uses video input from two cameras to track head and hand positions in three dimensions. This data is used to construct view-invariant features from the motions recorded by the cameras. The video tracking system is able to track head and hand position to an accuracy of two centimeters at thirty frames per second. The HTK2.0 HMM toolkit from Entropic Research was used to analyze the feature vectors created from the data captured by the video tracking system.
A set of eighteen T'ai Chi gestures were used in an experiment to test the system. Each gesture is modeled as a forward chaining continuous HMM, using some function of the coordinates of the head and hand positions as features. Several functions of the coordinates are examined, including Cartesian positions (x,y,z) and velocities (dx,dy,dz), polar positions (r,theta,z) and velocities (dr, dTheta, dz), and curvature (ds, log(rho), dz). The features expressed in Cartesian coordinates included sets with and without the head data.
The feature vector with the best overall recognition rate was the polar velocity feature (dr, dTheta, dz) at 95%, followed closely by the polar velocity with radius-scaled theta feature at 93% and the polar velocity feature with redundant Cartesian velocity data at 92%. Not surprisingly, the authors concluded that Cartesian velocity features perform better with data containing translational shifts and polar velocity features perform better when the data contain rotations.
Discussion:
T'ai Chi was an interesting choice of gestures to recognize. From the images of gestures in figure 1, it seems the head position remains relatively fixed throughout all the gestures. If the head remains fixed, head position does not seem like it would be a very useful feature to examine when recognizing gestures.
The T'ai Chi gestures were performed seated due to space constraints. This undoubtedly affected the normal head movement produced when performing the gestures while not seated. Performing the T'ai Chi gestures while not seated would most likely produce more translation and possibly allow the Cartesian velocity feature to achieve higher recognition rates.
The data was resampled to a rate of 30 Hz, but the actual sample rate varied from 15 to 30 Hz. A cubic spline was used for resampling; it might be interesting to see the effects on data and recognition rates if different data fitting curves would have been used.
Shifted and rotated versions of the gesture data were collected during the experiment, but some of the rotation was added synthetically. I wonder why the shifted and rotated data were collected at all, if they could have been generated synthetically as well.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment