Gesture Recognition - Jace Miller: February 2008

Sunday, February 17, 2008

An Architecture for Gesture-Based Control of Mobile Robots (Iba 1999)

Summary:
The paper looks at the use of hand gestures as a way for inexperienced users to direct robots. A difficulty arises in communicating the intention of a task to a robot instead of providing a set of actions that can be mimicked. The system described in the paper interprets gestures as either local or global instructions to a robot. Raw data captured from a CyberGlove is compacted into a 20-element vector describing 10 features and their first derivatives. The 20 dimensional data is reduced to one of 32 codewords. A 3-state HMM leverages temporal data from the gestures, turning sequences of codewords into one of six defined gestures, or an unspecified gesture. The HMM allows the classification of non-gestures (those whose probability is less than 0.5) due to the addition of an initial wait state in the HMM. The paper reports 96% accuracy in recognizing gestures using the HMM with the wait state.

Discussion:
The key insight in this paper's technique is the introduction of a "wait state" in the HMM that allows a gesture to be classified as not recognized. Adding more gestures to be recognized would reduce the effectiveness of a wait state, so the technique may not be well suited for distinguishing individual gestures from a large collection.
The distinction between gesture recognition and gesture spotting made at the beginning of the paper is important. The exclusion of non-gestures from an input sequence is a crucial facet of gesture spotting that distinguishes it from gesture recognition. This paper helped me form a better context for the possible applications of HMMs to gesture recognition. It was a good choice so soon after the Introduction to HMM paper by Rabiner and Juang.
The paper mentions that the length of an observation sequence should be limited to maintain a high probability that the observation sequence is produced by the model. However, no guidelines are given as how to select n, the number of most recent observations in a sequence.
It was not clear whether the waving motion to dodge obstacles temporarily placed the robot on a detour in the direction of the wave until the obstacle was passed, or permanently altered the destination of the robot. I suspect the latter was implemented, since it would be easier to divert the robot along a straight trajectory in th direction of the wave. Performing some kind of obstacle avoidance for a temporary detour could be a possible improvement for the system.

Sunday, February 3, 2008

The HoloSketch VR Sketching System (Deering 1996)

Summary:
The paper discusses a general 3D drawing program called HoloSketch which allows interaction using a specialized 3D input device. The display for the HoloSketch system is a 20" CRT monitor with a resolution of 960x680, relatively high compared to head mounted displays of that time. A pair of head-tracked shutter goggles allows the user to view a sequential stereo images as one 3D image, similar to a hologram. Although the HoloSketch system was optimized for the stereo CRT display, it could be used with other types of displays in a variety of applications.
The primary input device is a 3D wand that has six degrees of freedom. The menu system for HoloSketch takes advantage of the 3D positioning of the wand and is described as "fade-up", since the geometry fades away and the menu options fade into place just in front of where the geometry was located. Commands such as movement, rotation, and finer motion control require the wand input be combined with key presses to be interpreted. Elemental animation objects can be applied to primitives to achieve animation of position, orientation, scale, and color.
The paper mentions the possibility of interoperability with geometry from other 3D programs, but does not mention what specific packages or file formats are supported. The paper demonstrates that a general purpose drawing program can be extended into 3D, but suggests that the input devices required for 3D manipulation may not be accurate enough for mainstream users.

Discussion:
The extent to which accuracy was pursued in the project is commendable. I was impressed by the fact that the system corrects for changes in intraocular distance due to the rotation of the viewer's eyes in their sockets.
The design of the interface could be improved in two ways:
1. While an RGB cube is a uniquely 3D choice for color selection, I think the colors not on the exterior of the cube would be difficult to select for a novice user that doesn't have the intuition of the RGB colorspace. In particular, to get shades of gray colors, the user would have to move along the diagonal of the cube between the vertices corresponding to black and white. Looking at the exterior of the RGB color cube would not reveal any gray colors.
2. When selecting an object, the object flashes between it's intrinsic color and white. This visual cue does not work well for colors that are very near white and is imperceptible if the object is pure white. Instead, I think an improvement would flashing between the intrinsic color and its compliment. That way, a user will be able to notice the flashing regardless of its intrinsic color.
The ability to draw two-handed was added into the program because of the test user's request. The mouse motion was mapped to control the radius of the toothpaste primitive. Instead of requiring the mouse motion to control thickness, a possible extension would be to add a second wand-like input so that the interaction could take place more directly in 3D space. Using the mouse to vary the radius requires seeing the output to get feedback on the changes, whereas a second position-tracking device would provide one-to-one control.

An Introduction to Hidden Markov Models (Rabiner 1986)

Summary:
Markov chains try to solve the problem of recognizing sequences of observations by comparison to a signal model that is found to characterize some sequence of symbols. A first attempt suggests fitting linear system models over several short time segments to approximate the signal itself. This model is rejected in favor of the Hidden Markov Model.
A Hidden Markov Model consists of a number of states, each of which is associated with a probability distribution to decide what the next state to transition to should be. As each state is entered, an observation is recorded which is determined by a fixed, state-specific "observation probability distribution". In addition, there is an initial probability distribution which describes the likelihood of being in each state when observations begin.
Three types of problems associated with HMMs are outlined and their solutions are given. The first problem deals with finding the probability of an observation sequence given an observation sequence and a model. The second problem is concerned with finding the optimal sequence of states associated with a given observation sequence and is usually accomplished by considering each state individually. The third problem concentrates on how to optimize model parameters to best describe how the observed sequence occurs. The three problems can be solved using the forward-backward procedure, the Viterbi algorithm, and the Baum-Welch method, respectively.
The paper lists recognition results for three different HMM observation models (all of which are above 97%) when applied to single word recognition on vocabularies of 10 digits.

Discussion:
Hidden Markov Models are used in speech recognition, but the paper did not mention how a HMM could be applied to gesture recognition. I thought the coin tossing example was more confusing than simply defining the model. My introduction to Markov chains was in the context of weather prediction. Given a context for examples with inherent practical motivation seemed to make more sense than a non-motivated context, like the outcome of tossing coins.

American Sign Language Fingerspelling Recognition System (Allen 2003)

Summary:
The paper proposes a system to identify the letters of the alphabet which are represented in American sign language by static gestures. The letters J and Z were not able to be recognized due to their signs involving moving gestures. The recognition system collected data from an 18 sensor CyberGlove using Labview, loaded the data into a MatLab program to train a perceptron network, and then used a second MatLab program to match glove input with the most closely corresponding letter. Only one user was selected to train the neural network, resulting in an "up to 90%" accuracy rate. MatLab's default configuration to not run in real-time was cited as a main obstacle in development.

Discussion:
The paper admittedly "took a more narrow initial focus", so to the fact that only static gestures were recognized is understandable. Still, translating static poses of one user into spoken letters is very far from the long term goal of translating moving gestures into spoken English. The single result of 90% reported was not very illuminating.

Gesture Recognition - Jace Miller