Gesture Recognition - Jace Miller

Thursday, May 8, 2008

Real-Time Locomotion Control by Sensing Gloves (Komura 2006)

Summary:
The paper focuses on a method of mapping the hand motion of a user to locomotion of virtual characters in real time. The proposed method allows control of a multi-joint character through the motion of fingers whose joint angles are detected by a sensing glove. Animations are recorded via motion capture or generated using 3D modeling software and played back to a user. The user mimics the animation by moving his hands and the motion is used to generate a function that maps the hand motion to the standard animation. Then, new locomotion animations can be generated in real-time when the user's hand moves in different ways not necessarily within the domain of the motion used to generate the mapping function.
When calculating the correspondence between finger and animation motion, a cyclic locomotion is assumed in the calibration cycle. The period of each degree of freedom is calculated by finding the auto-correlation of the trajectories of that degree of freedom's signal. During play, the period of each joint in the animation is compared to the period of finger motion and the joints are classified as full cycle, half cycle, or exceptional. The hand's general coordinates are matched with the character's. Relative velocities of the character's end effectors to it's root are compared with relative velocities of fingertips to wrist to match each end effector with a finger. The motion of a joint is determined by the motion of finger(s) associated with the end effector(s) which are its descendants.
To test the method, mapping functions were generated for both human and dog walking animations. The animations were successfully extrapolated into running animations in real time. An experiment involving four subjects was conducted to compare the speed and number of collisions of a game character being navigated through a maze. On average, keyboard controlled locomotion took 9% less time, but resulted in more than twice as many collisions as sensing glove controlled locomotion.

Discussion:
An extension that would improve the realism of the motion would be animation blending. Currently, the system uses a single animation played at different speeds controlled by finger speed. However, a walking animation played back at double the intended rate does not look the same as a running animation. Animation blending uses some function to combine the weights of joint angles in an animation so that the resulting frames of animation transition naturally between animations, depending on speed.
During the control stage, when finger motion exceeds the original bounds of the mapping function generated in the calibration stage, extrapolation of the mapping function is performed for a limited space outside of the original domain. Instead of only considering the tangent (just the first derivative) to the mapping function at its endpoint, more derivatives could be considered to obtain a more accurate estimate of motion outside the original domain.
Controlling jumping motion of a character with the hand seems like a strange gameplay choice. If the hand position was mapped to character height, then holding the hand in the air would allow the player to hover, which is not realistic. If jumping were simulated in a physically realistic way, then mapping hand position to character height becomes pointless since all hand-height information would be discarded, except for the start time of the jump.

The 3D Tractus: A Three-Dimensional Drawing Board (Lapides 2006)

Summary:
The 3D Tractus is a system for creating three dimensional sketches whose interface consists of a 2D drawing surface that slides up and down to incorporate the third dimension. A user sitting at the Tractus draws on the tabletop surface with one hand and slides the tabletop up and down with the other. The goal of directly mapping virtual and physical space with the tabletop height is achieved with a string potentiometer-based sensor. A Toshiba Portege M200 Tablet PC was used as the interactive top of the 3D Tractus table. A counterweight system was used to ease the act of adjusting table height.
The software interface for the Tractus incorporates a 2D drawing area and a 3D overview window. The drawing area accepts pen based input and the 3D view can be rotated by dragging the pen in the desired direction. Multiple attempts at expressing depth on the 2D drawing surface were attempted. Changing color and color intensity in response to line depth did not provide intuitive feedback with enough contrast to communicate distance accurately. The current implementation uses varying line width and a perspective projection to convey depth. Lines above the drawing plane are not rendered.
Three students with art backgrounds evaluated the system for 30 minutes and asked for feedback. The most requested feature was the ability to tilt the drawing surface to allow work to be done from any angle, similar to how the 3D view can be rotated.

Discussion:
One of the programs discussed in the related work section allows a user to select a slice of their 3D object, pull it out of the stack, and edit it. This approach would be unmanageably tedious if working in th direction normal to the surface of the slices. Even if the Tractus doesn't completely remove the problem of working in a direction along the normal of slice surfaces, it greatly reduces the time required to do so by removing the steps of selecting to remove, and replacing, slices in the stack.
Directly mapping physical and virtual space limits the detail of sketches that could be more easily accomplished if zoom functionality was available. The ability to scale the mapping would allow the user to work at multiple levels of detail. Another improvement could be the use of stereovision to give the impression of depth on the 2D drawing area. Also, the addition of a button to toggle the rendering of lines above the drawing plane would aid sketch construction in some cases.

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning (Lieberman 2007)

Summary:
The TIKL system is proposed as an additional channel for learning motor skills, providing low-latency feedback through a vibrotactile feedback suit. The Vicon tracking system uses twelve high speed infrared cameras which record positions of markers so that the 3d position of the markers can be reconstructed and joint angles can be calculated. Differences in joint angle between the teacher and learner are communicated to the learner through timed tactile pulses that utilize the sensory saltation effect. Corrective rotation can be communicated by a sequence of pulses around actuators that are attached in a ring around the arm.
An experiment was run to test how users' ability to learn motor tasks was affected by the tactile feedback provided by TIKL. Using deviations in joint angle as the metric for error, forty subjects' reactions during training sequences were measured. Real-time errors were reduced by up to 27% and learning was improved by up to 23%. Subjects reported no significant loss of comfort due to the wearable system.

Discussion:
One downside of the TIKL system is its reliance on the expensive and bulky Vicon system. Currently, only angle errors are used in determining feedback signal strength, which may not be the best measurement of task fulfillment. Other measures of error such as end effector position probably describe error better for some tasks. The problem of correcting end effector position alone can be hard since there would be multiple joint positions that could result in the same end position.
The users participating in the study were told to mimic joint angles as opposed to another metric. The error reporting assumes that people easily grasp the concept of mimicking angles and remembered to follow only that metric throughout the study. I don't think people naturally mimic only joint angles and that some error was probably introduced due to using only joint angle as the error metric. In regard to the results section, I thought polling users seems like a somewhat subjective way to measure qualities of the system. A more quantitative study could have been designed.

A Hidden Markov Model Based Sensor Fusion Approach for Recognizing Continuous Human Grasping Sequences (Bernardin 2005)

Summary:
This paper proposes a system that uses information about both hand shape and contact points obtained from a combination of a data glove and tactile sensors to recognize continuous human grasp sequences. The long term goal of the study would be to teach a robotic system to perform a task simply by observing a human teacher instead of explicitly programming the robot. For this kind of learning to take place, the robot must be able to infer what has been done and map that to a known skill that can be described symbolically. This paper analyzes grasping gestures in order to aid the construction of their symbolic descriptions.
An 18 sensor Cyberglove is used in conjunction with an array of 16 capacitive pressure sensitive sensors affixed to the fingers and palm. Classification of grasping gestures is mad according to Kamakura's grasp taxonomy, which identifies 14 different kinds of grasps. The regions of the hand which are covered by pressure sensors were chosen to maximize the detection of contact with a minimal number of sensors while also corresponding to the main regions in Kamakura's grasp types. An HMM was built for each type of grasp using 112 training gestures. The HMMs have a flat topology with 9 states and were trained offline.
The main benefit gleaned from the tactile sensors was the ability to perform segmentation easier.

Discussion:
The customization required of the capacitive pressure sensors indicates that there is not currently a mass-produced component to fill the demand for grasp detection hardware. In the description of the HMM recognizer, it is mentioned that a task grammar was used to reduce the search space of the recognizer. Since only grasp and release sequences are recognized, the segmentation problem is avoided.
If the end goal is to teach a robot to learn grasps by observation, I think an experiment that used both visual-based and glove-based inputs would be required to discern a link between the visual and tactile realms. The visual signal could be analyzed and possibly mapped to a tactile response.

Articulated Hand Tracking by PCA-ICA Approach (Kato 2006)

Summary:
The method proposed to track finger and hand motions in this paper begins by performing principle component analysis to reduce the number of feature dimensions that are considered. Then independent component analysis is performed on the lower dimensional result to get intrinsic feature vectors. The ICA technique finds a linear, non-orthogonal coordinate system in multivariate data with the goal of performing a linear transformation which makes the resulting variables as statistically independent as possible.
A hand was modeled in OpenGL and projected onto an image captured of a real hand to test the PCA-ICA method. The report of the results only says the effectiveness of the method used to track a hand in real image sequences was demonstrated. There were not any numerical statistics reported in the results.

Discussion:
The PCA-ICA method considers only finger bentness, not hand position or orientation. Recognizing hand gestures that rely on that kind of data could not solely depend n the PCA-ICA approach.
The pictures of hand motion based on ICA basis vectors look better than the pictures of hand motion based on PCA basis vectors. I think this may be due to the fact that five vectors were considered. What would ICA look like if some number other than five was used? I don't know if there would be a good mapping of ICA basis vectors to hand renderings for a number of vectors other than five (one for each finger).
I think one of the insights of this paper was its use of knowledge from a different field. The number of dimensions used in component analysis can be reduced by considering important observations from the field of biomechanics. In particular the bend of an end joint of a finger can typically be described as two thirds the bend of the next joint's bend along the same finger.

Friday, May 2, 2008

A Survey of Hand Posture and Gesture Recognition Techniques and Technology (LaViola 1999)

Summary:
This aptly named paper presents a variety of gesture and posture recognition techniques. The techniques are loosely grouped as either feature extraction and modeling, learning algorithms, or miscellaneous. Template matching is one of the simplest techniques to implement and accurate for small sets of postures, but is not suited for gestures. Feature extraction and analysis uses a layered architecture that can handle both postures and gestures, but it can be computationally expensive if large numbers of features are extracted. Active shape models allow for real time recognition but only tracks the open hand. Principal component analysis can recognize around thirty postures, but it requires training by more than one person to achieve accurate results for multiple users. Linear fingertip models are only concerned with the starting and ending points of fingertips and only recognize a small set of postures. Causal analysis uses information about how humans interact with the world to identify gestures, and therefore, can only be applied to a limited set of gestures. Neural networks can recognize large posture or gesture sets with high accuracy given enough training data. However, the training can be very time consuming and the network must be retrained when items are added or removed from the set to be recognized. Hidden Markov models are well covered in the literature and can be used with either a vision or instrumented approach. Training HMMs can be time consuming and does not necessarily give good recognition rates. Instance-based learning techniques are relatively simple to implement, but require a large amount of memory and computation time and are not suited for real time recognition. Spatio-temporal vector analysis is a non-obtrusive, computationally intensive, vision based approach which has not reported any recognition accuracy results.

Discussion:
One aspect of this paper I liked was the summaries after each subsection which highlighted key points of the previous paragraphs. We have discussed HMM based techniques so often in class, that it was refreshing to see a wider variety of approaches. This paper was good for brainstorming what techniques to use, expand on, or combine for continuing work in gesture recognition. The experiments we did in class where a gesture was shown and described in words by each person gave some experience with the linguistic approach. Through some of the difficulties in class, I realize that it can be difficult to describe a posture accurately and universally understandable using words only. The linguistic approach in the paper only considered postures in which fingers were fully extended or contracted, which covers only a small set of all possible postures. The paper says the approach is simple, but I would say it is difficult to do when considering a set of postures that is not tightly constrained.

Discourse Topic and Gestural Form (Eisenstein 2008)

Summary:
This paper examines the question of whether gestures without a relationship to a specific meaning are speaker independent or whether gestures are determined by the speaker's view of a discourse topic. To answer the question, a new characterization of gestures in terms of spatiotemporal interest points is used to extract a low level representation of gestures from video without manual intervention. The interest points are grouped into categories of codewords. The distribution of these codewords across speakers and topics form the basis for analysis.
During the spatiotemporal interest point extraction from images, principal component analysis is used to reduce the dimensionality of the data to form the final feature vectors describing the gestures. A hierarchical Bayesian model was used to cluster gesture features and determine whether the gesture is related to the speaker or the discourse topic. To allow inference of the hidden variables in the Bayesian model, Gibbs sampling is used since it is guaranteed to converge to the true distribution over the hidden variables in the limit.
An experiment was conducted to analyze 33 short videos of fifteen different speakers covering five different topics. Four of the topics of discussion were chosen to be mechanical devices, since direct gestural metaphors would likely show up in the discussion of such items. The study concluded that gestures are topic dependent and the relationship overshadows specific speakers. With correct labeling of topics, 12% of gestures were related to the topic, but randomly labeled topics, less than 3% of gestures are considered topic-specific.

Discussion:
The interest point features that were extracted exist at a lower level of abstraction than most other attempts at describing gestures. It is interesting that such low-level attributes are sufficient to perform successful recognition.
I think the limitation to mainly mechanical devices as topics could have influenced the ability to spot topic-specific gestures. If general categories were chosen, I think the topic-specific nature of gestures would be reduced.
I was not familiar with all of the mathematics discussed in the paper, so I'm not sure how sound their technique is from a theoretical point of view. Looking at the pictures of identified interest points and the frames before and after the interest point convinced me that the feature extraction must work in some cases.