Gesture Recognition - Jace Miller: 2008

Thursday, May 8, 2008

Real-Time Locomotion Control by Sensing Gloves (Komura 2006)

Summary:
The paper focuses on a method of mapping the hand motion of a user to locomotion of virtual characters in real time. The proposed method allows control of a multi-joint character through the motion of fingers whose joint angles are detected by a sensing glove. Animations are recorded via motion capture or generated using 3D modeling software and played back to a user. The user mimics the animation by moving his hands and the motion is used to generate a function that maps the hand motion to the standard animation. Then, new locomotion animations can be generated in real-time when the user's hand moves in different ways not necessarily within the domain of the motion used to generate the mapping function.
When calculating the correspondence between finger and animation motion, a cyclic locomotion is assumed in the calibration cycle. The period of each degree of freedom is calculated by finding the auto-correlation of the trajectories of that degree of freedom's signal. During play, the period of each joint in the animation is compared to the period of finger motion and the joints are classified as full cycle, half cycle, or exceptional. The hand's general coordinates are matched with the character's. Relative velocities of the character's end effectors to it's root are compared with relative velocities of fingertips to wrist to match each end effector with a finger. The motion of a joint is determined by the motion of finger(s) associated with the end effector(s) which are its descendants.
To test the method, mapping functions were generated for both human and dog walking animations. The animations were successfully extrapolated into running animations in real time. An experiment involving four subjects was conducted to compare the speed and number of collisions of a game character being navigated through a maze. On average, keyboard controlled locomotion took 9% less time, but resulted in more than twice as many collisions as sensing glove controlled locomotion.

Discussion:
An extension that would improve the realism of the motion would be animation blending. Currently, the system uses a single animation played at different speeds controlled by finger speed. However, a walking animation played back at double the intended rate does not look the same as a running animation. Animation blending uses some function to combine the weights of joint angles in an animation so that the resulting frames of animation transition naturally between animations, depending on speed.
During the control stage, when finger motion exceeds the original bounds of the mapping function generated in the calibration stage, extrapolation of the mapping function is performed for a limited space outside of the original domain. Instead of only considering the tangent (just the first derivative) to the mapping function at its endpoint, more derivatives could be considered to obtain a more accurate estimate of motion outside the original domain.
Controlling jumping motion of a character with the hand seems like a strange gameplay choice. If the hand position was mapped to character height, then holding the hand in the air would allow the player to hover, which is not realistic. If jumping were simulated in a physically realistic way, then mapping hand position to character height becomes pointless since all hand-height information would be discarded, except for the start time of the jump.

The 3D Tractus: A Three-Dimensional Drawing Board (Lapides 2006)

Summary:
The 3D Tractus is a system for creating three dimensional sketches whose interface consists of a 2D drawing surface that slides up and down to incorporate the third dimension. A user sitting at the Tractus draws on the tabletop surface with one hand and slides the tabletop up and down with the other. The goal of directly mapping virtual and physical space with the tabletop height is achieved with a string potentiometer-based sensor. A Toshiba Portege M200 Tablet PC was used as the interactive top of the 3D Tractus table. A counterweight system was used to ease the act of adjusting table height.
The software interface for the Tractus incorporates a 2D drawing area and a 3D overview window. The drawing area accepts pen based input and the 3D view can be rotated by dragging the pen in the desired direction. Multiple attempts at expressing depth on the 2D drawing surface were attempted. Changing color and color intensity in response to line depth did not provide intuitive feedback with enough contrast to communicate distance accurately. The current implementation uses varying line width and a perspective projection to convey depth. Lines above the drawing plane are not rendered.
Three students with art backgrounds evaluated the system for 30 minutes and asked for feedback. The most requested feature was the ability to tilt the drawing surface to allow work to be done from any angle, similar to how the 3D view can be rotated.

Discussion:
One of the programs discussed in the related work section allows a user to select a slice of their 3D object, pull it out of the stack, and edit it. This approach would be unmanageably tedious if working in th direction normal to the surface of the slices. Even if the Tractus doesn't completely remove the problem of working in a direction along the normal of slice surfaces, it greatly reduces the time required to do so by removing the steps of selecting to remove, and replacing, slices in the stack.
Directly mapping physical and virtual space limits the detail of sketches that could be more easily accomplished if zoom functionality was available. The ability to scale the mapping would allow the user to work at multiple levels of detail. Another improvement could be the use of stereovision to give the impression of depth on the 2D drawing area. Also, the addition of a button to toggle the rendering of lines above the drawing plane would aid sketch construction in some cases.

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning (Lieberman 2007)

Summary:
The TIKL system is proposed as an additional channel for learning motor skills, providing low-latency feedback through a vibrotactile feedback suit. The Vicon tracking system uses twelve high speed infrared cameras which record positions of markers so that the 3d position of the markers can be reconstructed and joint angles can be calculated. Differences in joint angle between the teacher and learner are communicated to the learner through timed tactile pulses that utilize the sensory saltation effect. Corrective rotation can be communicated by a sequence of pulses around actuators that are attached in a ring around the arm.
An experiment was run to test how users' ability to learn motor tasks was affected by the tactile feedback provided by TIKL. Using deviations in joint angle as the metric for error, forty subjects' reactions during training sequences were measured. Real-time errors were reduced by up to 27% and learning was improved by up to 23%. Subjects reported no significant loss of comfort due to the wearable system.

Discussion:
One downside of the TIKL system is its reliance on the expensive and bulky Vicon system. Currently, only angle errors are used in determining feedback signal strength, which may not be the best measurement of task fulfillment. Other measures of error such as end effector position probably describe error better for some tasks. The problem of correcting end effector position alone can be hard since there would be multiple joint positions that could result in the same end position.
The users participating in the study were told to mimic joint angles as opposed to another metric. The error reporting assumes that people easily grasp the concept of mimicking angles and remembered to follow only that metric throughout the study. I don't think people naturally mimic only joint angles and that some error was probably introduced due to using only joint angle as the error metric. In regard to the results section, I thought polling users seems like a somewhat subjective way to measure qualities of the system. A more quantitative study could have been designed.

A Hidden Markov Model Based Sensor Fusion Approach for Recognizing Continuous Human Grasping Sequences (Bernardin 2005)

Summary:
This paper proposes a system that uses information about both hand shape and contact points obtained from a combination of a data glove and tactile sensors to recognize continuous human grasp sequences. The long term goal of the study would be to teach a robotic system to perform a task simply by observing a human teacher instead of explicitly programming the robot. For this kind of learning to take place, the robot must be able to infer what has been done and map that to a known skill that can be described symbolically. This paper analyzes grasping gestures in order to aid the construction of their symbolic descriptions.
An 18 sensor Cyberglove is used in conjunction with an array of 16 capacitive pressure sensitive sensors affixed to the fingers and palm. Classification of grasping gestures is mad according to Kamakura's grasp taxonomy, which identifies 14 different kinds of grasps. The regions of the hand which are covered by pressure sensors were chosen to maximize the detection of contact with a minimal number of sensors while also corresponding to the main regions in Kamakura's grasp types. An HMM was built for each type of grasp using 112 training gestures. The HMMs have a flat topology with 9 states and were trained offline.
The main benefit gleaned from the tactile sensors was the ability to perform segmentation easier.

Discussion:
The customization required of the capacitive pressure sensors indicates that there is not currently a mass-produced component to fill the demand for grasp detection hardware. In the description of the HMM recognizer, it is mentioned that a task grammar was used to reduce the search space of the recognizer. Since only grasp and release sequences are recognized, the segmentation problem is avoided.
If the end goal is to teach a robot to learn grasps by observation, I think an experiment that used both visual-based and glove-based inputs would be required to discern a link between the visual and tactile realms. The visual signal could be analyzed and possibly mapped to a tactile response.

Articulated Hand Tracking by PCA-ICA Approach (Kato 2006)

Summary:
The method proposed to track finger and hand motions in this paper begins by performing principle component analysis to reduce the number of feature dimensions that are considered. Then independent component analysis is performed on the lower dimensional result to get intrinsic feature vectors. The ICA technique finds a linear, non-orthogonal coordinate system in multivariate data with the goal of performing a linear transformation which makes the resulting variables as statistically independent as possible.
A hand was modeled in OpenGL and projected onto an image captured of a real hand to test the PCA-ICA method. The report of the results only says the effectiveness of the method used to track a hand in real image sequences was demonstrated. There were not any numerical statistics reported in the results.

Discussion:
The PCA-ICA method considers only finger bentness, not hand position or orientation. Recognizing hand gestures that rely on that kind of data could not solely depend n the PCA-ICA approach.
The pictures of hand motion based on ICA basis vectors look better than the pictures of hand motion based on PCA basis vectors. I think this may be due to the fact that five vectors were considered. What would ICA look like if some number other than five was used? I don't know if there would be a good mapping of ICA basis vectors to hand renderings for a number of vectors other than five (one for each finger).
I think one of the insights of this paper was its use of knowledge from a different field. The number of dimensions used in component analysis can be reduced by considering important observations from the field of biomechanics. In particular the bend of an end joint of a finger can typically be described as two thirds the bend of the next joint's bend along the same finger.

Friday, May 2, 2008

A Survey of Hand Posture and Gesture Recognition Techniques and Technology (LaViola 1999)

Summary:
This aptly named paper presents a variety of gesture and posture recognition techniques. The techniques are loosely grouped as either feature extraction and modeling, learning algorithms, or miscellaneous. Template matching is one of the simplest techniques to implement and accurate for small sets of postures, but is not suited for gestures. Feature extraction and analysis uses a layered architecture that can handle both postures and gestures, but it can be computationally expensive if large numbers of features are extracted. Active shape models allow for real time recognition but only tracks the open hand. Principal component analysis can recognize around thirty postures, but it requires training by more than one person to achieve accurate results for multiple users. Linear fingertip models are only concerned with the starting and ending points of fingertips and only recognize a small set of postures. Causal analysis uses information about how humans interact with the world to identify gestures, and therefore, can only be applied to a limited set of gestures. Neural networks can recognize large posture or gesture sets with high accuracy given enough training data. However, the training can be very time consuming and the network must be retrained when items are added or removed from the set to be recognized. Hidden Markov models are well covered in the literature and can be used with either a vision or instrumented approach. Training HMMs can be time consuming and does not necessarily give good recognition rates. Instance-based learning techniques are relatively simple to implement, but require a large amount of memory and computation time and are not suited for real time recognition. Spatio-temporal vector analysis is a non-obtrusive, computationally intensive, vision based approach which has not reported any recognition accuracy results.

Discussion:
One aspect of this paper I liked was the summaries after each subsection which highlighted key points of the previous paragraphs. We have discussed HMM based techniques so often in class, that it was refreshing to see a wider variety of approaches. This paper was good for brainstorming what techniques to use, expand on, or combine for continuing work in gesture recognition. The experiments we did in class where a gesture was shown and described in words by each person gave some experience with the linguistic approach. Through some of the difficulties in class, I realize that it can be difficult to describe a posture accurately and universally understandable using words only. The linguistic approach in the paper only considered postures in which fingers were fully extended or contracted, which covers only a small set of all possible postures. The paper says the approach is simple, but I would say it is difficult to do when considering a set of postures that is not tightly constrained.

Discourse Topic and Gestural Form (Eisenstein 2008)

Summary:
This paper examines the question of whether gestures without a relationship to a specific meaning are speaker independent or whether gestures are determined by the speaker's view of a discourse topic. To answer the question, a new characterization of gestures in terms of spatiotemporal interest points is used to extract a low level representation of gestures from video without manual intervention. The interest points are grouped into categories of codewords. The distribution of these codewords across speakers and topics form the basis for analysis.
During the spatiotemporal interest point extraction from images, principal component analysis is used to reduce the dimensionality of the data to form the final feature vectors describing the gestures. A hierarchical Bayesian model was used to cluster gesture features and determine whether the gesture is related to the speaker or the discourse topic. To allow inference of the hidden variables in the Bayesian model, Gibbs sampling is used since it is guaranteed to converge to the true distribution over the hidden variables in the limit.
An experiment was conducted to analyze 33 short videos of fifteen different speakers covering five different topics. Four of the topics of discussion were chosen to be mechanical devices, since direct gestural metaphors would likely show up in the discussion of such items. The study concluded that gestures are topic dependent and the relationship overshadows specific speakers. With correct labeling of topics, 12% of gestures were related to the topic, but randomly labeled topics, less than 3% of gestures are considered topic-specific.

Discussion:
The interest point features that were extracted exist at a lower level of abstraction than most other attempts at describing gestures. It is interesting that such low-level attributes are sufficient to perform successful recognition.
I think the limitation to mainly mechanical devices as topics could have influenced the ability to spot topic-specific gestures. If general categories were chosen, I think the topic-specific nature of gestures would be reduced.
I was not familiar with all of the mathematics discussed in the paper, so I'm not sure how sound their technique is from a theoretical point of view. Looking at the pictures of identified interest points and the frames before and after the interest point convinced me that the feature extraction must work in some cases.

Tuesday, April 29, 2008

3D Visual Detection of Correct NGT Sign Production (Lichtenauer 2007)

Summary:
In this paper, a 3D visual detection system is proposed to aid children in learning Dutch sign language. Two video cameras with wide angle lenses capture 640x480 resolution images at 25 frames per second. The user's head and hands are tracked based on following skin-colored segments of the image from frame to frame. Adaptive skin color modeling determines how skin color appears under different lighting conditions, but must first be initialized by selecting some pixels within the face to and a square of pixels surrounding the head. In practice, the colors of pixels showing skin were distributed in a bimodal manner due to two different light sources. Each modality was modeled separately to reduce mis-classification of pixels as skin colored or not. Classification of gestures is done based on fifty hand features. Dynamic Time Warping is used to find the level of time-correspondence between an input gesture and a reference gesture.
A set of 120 different NGT signs performed by 70 individuals are used to test the sign classification. Cross validation was performed to effectively increase the amount of test data many times. Overall, the true positive classification rate was 95%. Dynamic time warping only improved recognition of signs with repetitious motion.

Discussion:
The classification algorithm relies on knowing the approximate start and end times of a sign. Some other kind of segmentation scheme could be applied so that the hands must not always come to rest on the table between signs. As a teaching tool, resting between signs limits the learning to single words. Multi-gesture phrases and sentences cannot be recognized without segmentation.
The left and right blobs identified during skin detection tracking are assigned to the left and right hands, regardless of which hand is on which side of the body. So, crossing hands will cause skin blobs to be mis-labeled, giving rise to the possibility of classifying two distinct gestures (both consisting of the same basic motion, but one in which hands cross and one in which they do not) as the same.

Television control by hand gestures (Freeman 1994)

Summary:
This paper described a hand movement recognition system for controlling a television. The movements of an open hand were tracked and displayed to the user via a hand icon imposed on a secondary display, but which could overlay a television display in a more commercially viable prototype. The system captures images from a video camera which are downsampled to a 320 x 240 resolution. The images are processed on an HP 735 workstation which is connected via serial port to an All-In-One-12 remote control which can send commands to the television at a rate of about once per second. The recognition of the open hand trigger gesture takes about half a second. Once recognized, the hand position can be tracked at a rate of about five times per second. The images in the appendix were helpful in showing what the interface looked like and illustrating how background removal and local orientation video processing worked.

Discussion:
One scenario not covered in the paper was two users attempting to control the television simultaneously. The slider bar control interface for changing channels would not work well for a large number of channels because very fine control would be needed. A possible improvement would be to use a number pad or one slider per channel digit. The paper mentions that the users of the prototype system were excited, but I think it was due to the novelty of the system instead of an inherent fun factor. No results of the user study were given besides the fact that people seemed to enjoy using the system.

Monday, April 28, 2008

Invariant features for 3-D gesture recognition (Campbell 1996)

Summary:
This paper creates a systme that uses video input from two cameras to track head and hand positions in three dimensions. This data is used to construct view-invariant features from the motions recorded by the cameras. The video tracking system is able to track head and hand position to an accuracy of two centimeters at thirty frames per second. The HTK2.0 HMM toolkit from Entropic Research was used to analyze the feature vectors created from the data captured by the video tracking system.
A set of eighteen T'ai Chi gestures were used in an experiment to test the system. Each gesture is modeled as a forward chaining continuous HMM, using some function of the coordinates of the head and hand positions as features. Several functions of the coordinates are examined, including Cartesian positions (x,y,z) and velocities (dx,dy,dz), polar positions (r,theta,z) and velocities (dr, dTheta, dz), and curvature (ds, log(rho), dz). The features expressed in Cartesian coordinates included sets with and without the head data.
The feature vector with the best overall recognition rate was the polar velocity feature (dr, dTheta, dz) at 95%, followed closely by the polar velocity with radius-scaled theta feature at 93% and the polar velocity feature with redundant Cartesian velocity data at 92%. Not surprisingly, the authors concluded that Cartesian velocity features perform better with data containing translational shifts and polar velocity features perform better when the data contain rotations.

Discussion:
T'ai Chi was an interesting choice of gestures to recognize. From the images of gestures in figure 1, it seems the head position remains relatively fixed throughout all the gestures. If the head remains fixed, head position does not seem like it would be a very useful feature to examine when recognizing gestures.
The T'ai Chi gestures were performed seated due to space constraints. This undoubtedly affected the normal head movement produced when performing the gestures while not seated. Performing the T'ai Chi gestures while not seated would most likely produce more translation and possibly allow the Cartesian velocity feature to achieve higher recognition rates.
The data was resampled to a rate of 30 Hz, but the actual sample rate varied from 15 to 30 Hz. A cubic spline was used for resampling; it might be interesting to see the effects on data and recognition rates if different data fitting curves would have been used.
Shifted and rotated versions of the gesture data were collected during the experiment, but some of the rotation was added synthetically. I wonder why the shifted and rotated data were collected at all, if they could have been generated synthetically as well.

FreeDrawer - A Free-Form Sketching System on the Responsive Workbench (Wesche 2001)

Summary:
FreeDrawer is a collection of 3D curve drawing and deformation tools targeted towards product design. The Responsive Workbench program models objects using Catmull-Clark surfaces and B-spline curves as opposed to Constructive Solid Geometry or voxel-based virtual clay. Deformations allowed by B-spline curves and Catmull-Clark surfaces are beneficial for product design, in which original sketches need to be modified to produce the final shape.
The first curve of an object can be freely drawn anywhere in space, but subsequent curves are woven into the existing curve network - being altered if they do not intersect an existing curve. When a curve is modified, the rest of the model can be altered in two ways. In local mode, the modified curve and curves adjacent to it are altered. In global mode, all curves are altered recursively. Surfaces can be filled in with Catmull-Clark or Kuriyama surfaces when a closed loop of curves is selected.
The non-dominant hand is responsible for translating and rotating the model, while the dominant hand is responsible for editing and deforming the model. An original 3D widget with multiple pointers spread out in a fan shape support multiple editing tools - each pointer is mapped to a different tool. A sketch of a chair seat took an experienced user approximately 15 minutes to create.

Discussion:
The idea of the designer drawing the full sweep of a curve across both halves of a mirrored plane and the program averaging each to match symmetrically is new to me. Previously, I had only seen modeling programs where the shape being drawn is mirrored as it is being drawn, and the modeler is only required to draw half of the shape.
The subdivision level for surfaces are chosen so that each curve piece has the same number of segments. This makes creating the subdivided surfaces easier to compute, but is not the best option for models containing areas of small detail and large areas without much detail. In this case, the large areas of less detail are most likely over defined.
Neighboring surfaces are connected with only G1 continuity. For several real-world modeling applications, higher continuity connection is required. Usually, G2 continuity is the minimum smoothness demanded for surfaces which reflect or which need to have good air flow characteristics. If this system were to be used for real-world design, more work would have to be done so that higher continuity can be achieved. Unfortunately, the option of higher continuity may make the system slightly less easy to use.
While the fan-shaped, multi-function widget is novel and may reduce tool selection time for expert users, I think it would produce confusion and tool selection errors in novice users.

Shape Your Imagination: Iconic Gestural-Based Interaction (Marsh 1998)

Summary:
Iconic gestures are a natural, intuitive, and effective way to convey spatial information. This paper separates iconic gestures into virtual and substitutive categories. Virtual depiction is defined as using the spatial relationship between the hands as an outline or tracing the picture of an object or shape. Substitutive depiction is defined as shaping the hands to match the form of an object as if it were being held or grasped in the hands.
A user study is performed to determine if humans use iconic hand gestures during non-verbal communication of shapes and objects, and if so, what kind of gestures are used with what frequency, and how many hands are involved in the gestures. Twelve volunteers were seated in a room and asked to non-verbally describe fifteen shapes and objects using hand gestures. Care was taken not to influence the volunteers' descriptions or otherwise influence their gestures by providing images of the objects.
Subjects used two-handed gestures 88.1% of the time, preferring virtual depiction. 75% of gestures were performed using purely virtual depiction, 17.9% were performed using purely substitutive depiction, and 7.1% used a combination of both. Only one object, the circle, was performed using primarily one-handed virtual depiction gestures.

Discussion:
The results of this paper were not surprising. I suppose it was still good to confirm the ideas experimentally, although a larger group of subjects could have been used to obtain more statistically significant results. Also, since the goal was to see how gestures are used naturally, an experiment could be designed to observe when gestures accompany certain spoken words or phrases. Telling the subjects that they must communicate completely non-verbally does not seem like a situation in which the most natural gestures would be made.
In general, users found describing 2D shapes easier than 3D shapes. I think an analogy to this insight exists in the recognition realm. Sketch recognition in two dimensions is an easier problem than gesture recognition in three dimensions.

A Dynamic Gesture Recognition System for the Korean Sign Language (Kim 1996)

Summary:
The paper describes a system for recognizing Korean Sign Language (KSL) gestures and converting them into text. The system employs two ten-sensor VPL Data-Gloves to measure the bend in the joints of each digit on both hands. The gloves also sense the position and orientation of each hand relative to a fixed source. The hand gestures need to be recognized regardless of their position relative to an initial position which varies, the position data is recorded as the difference between the previous position and the current position. For the paper, 25 of around 6000 KSL gestures were analyzed and partitioned into ten sets based on their general direction. To determine which of the ten direction categories the motion will be classified as during recognition, the change over the five most recent readings of region data are examined. Hand postures are recognized by applying the technique of Fuzzy Min-Max Neural (FMMN) Networks. Input from the data glove are identified as one of the direction classes and then recognized by the FMMN network. The system classifies gestures correctly almost 85% of the time.

Discussion:
Figure 8 shows a diagram of the min-max neural network used for classification. There are ten input nodes at the bottom, one for each of the flex angles measured from the fingers. However, there should be fourteen class noes at the top of the diagram instead of ten since there are fourteen posture classes.
When motion data is expressed in its compressed form, the order of region data is preserved, but data relating to the length in time of each region is lost.
Although the paper states that the FMMN network requires no pre-learning about posture class and has on-line adaptability, I'd say the basic idea of neural networks does require learning since that is how the weights of the network are adjusted.
The mis-classifications are partially blamed on abnormal motions in gestures and postures, but dealing with data that exhibits less than ideal characteristics seems to be part of the point of applying complex solutions to gesture recognition.

Sunday, April 27, 2008

Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs (Song 2006)

Summary:
Most gesture segmentation and recognition systems examine gesture patterns for places to segment the data, which must be done before recognition. This requires that the gesture be complete before recognition begins, which delays the rate at which recognition can be performed. This work attempts to use a forward spotting scheme to perform gesture segmentation and recognition at the same time. A sliding window of observations is used to compute the probability of the observations within the window being a gesture or a non-gesture. The window averages affects of abruptly changing features. Accumulative HMMs are used in the gesture recognition portion of the system; they accept partial posture segments for which the most likely posture is determined. The final posture is determined by majority voting between the partial posture candidates.
An experiment using eight gestures to open and close curtains and turn lights on and off was performed to compare automatic and manual gesture segmentation and recognition. For segmentation, the automatic threshold method was reported to be 17.62% better than the manual threshold method. For recognition, the automatic method was reported to be 3.75% more accurate than the manual method.

Discussion:
The application of controlling curtains and lights in a home seemed a bit non-useful to me, since a gesture recognition system in the living room would probably receive a lot of signals that were not intended to be interpreted by the system. Installing such a system in a real home could cause some interesting scenarios for a family playing charades.
The results of the experiment depend on the manual recognition process, which was not described in the paper. Since there was no description of how the manual processes were performed, I don't know how to interpret the reported benefits of the automatic methods.
The gestures chosen for recognition are very simple and could be recognized by simply detecting whether a hand of the person passed some threshold of the camera space. HMMs are too complex of a solution for the type of gestures in the experiment.

A Survey of POMDP Applications (Cassandra 1998)

Summary:
The partially observable Markov decision process (POMDP) model is heavily based on regular Markov decision process models which have been the focus of other papers we have read dealing with Hidden Markov models. The defining parts of a POMDP model are three finite sets and three functions. Similar to HMMs, there is a set of states, a set of actions, and a set of observations. The transition and observation functions are the familiar probability distributions of HMMs. The main difference is the addition of an immediate reward function which gives the immediate benefit for performing each action while in each state.
The paper gives an overview of several application areas for the POMDP model. In the area of industrial applications, the example of finding a policy for machine maintenance is given. This is one of the oldest applications of the POMDP model and it fits the problem very well since the observations are probabilistically related to internal states. Other industrial applications include structural inspection, elevator control policies, and the fishery industry. Autonomous robots are given as another possible application of the POMDP model. Currently, finding control rules for robots using the method is done at a higher level of abstraction than at the sensor and actuator level. The paper mentions that it might be possible to use a hierarchical arrangement of POMDP models to get the model functioning closer to the level of the hardware. Other scientific applications include behavioral ecology and machine vision. Business applications include network troubleshooting, distributed database queries, and marketing. Some military and social applications are also mentioned.

Discussion:
One observation from the machine vision application example is that POMDP models work best in special purpose visual systems where the domain has been restricted. One of the special systems mentioned was a gesture recognizer which attempts pattern recognition. Since this area is the most closely related to instrumented gesture recognition, I think it may be beneficial to read the paper referenced in this section (Active Gesture Recognition using Partially Observable Markov Decision Processes by Trevor Darrell and Alex Pentland).
One of the more interesting application areas to me was education where the internal mental state of an individual is the model. The reward function could even be applied at an individual level, taking into consideration a student's learning style. I think the limitation of discretizing a space of concepts makes the application to learning not totally accurate. Concepts are inherently difficult to define. Some of the application areas seemed somewhat far-fetched in that too much information would be required to build an accurate model. In addition to these theoretical limitations, the practical limitations of representing and performing computation on the models is even more severe. Because of the difficulty of correct POMDP modeling, algorithms which consider characteristics of the problem area must be used to achieve timely results.

A Similarity Measure for Motion Stream Segmentation and Recognition (Chuanjun 2005)

Summary:
The paper proposes a similarity measure to deal with problems in recognizing and segmenting motion streams. Specifically, the problems dealt with are differing lengths of motion duration, different variations in attributes of similar motions, and the fact that motion streams are continuous with no obvious "pause" at which to segment. The model presented represents a motion by a matrix in which columns correspond to joint angles and rows correspond to samples in time. Singular value decomposition is applied to two of these motion matrices of varying rows to determine a measurement of similarity. Eigenvalues and eigenvectors of the matrices representing both streaming motion input and known motion patterns are calculated and compared. The more similar two matrices are, the closer their eigenvectors are to being parallel and the closer their eigenvalues are to being proportional to each other.
The measurement of similarity is referred to as k Weighted Angular Similarity (kWAS). Data was collected from both a CyberGlove and a Vicon motion capture system to test the performance of the kWAS measurement. Although the first two eigenvalues (k = 2 in terms of kWAS) account for more than 95% of the sums of all the eigenvalues, considering the first two eigenvectors was not sufficient to recognize gestures at an acceptable rate. Increasing the value of k to six increased the accuracy of recognition for CyberGlove data to 94% and motion capture data to 94.6%. These results were higher than two other similarity measures called Eros and MAS.

Discussion:
Not only was the kWAS measure more accurate, it required about the same running time as the MAS measurement and was twice as fast as the Eros measurement. It might be interesting to see how the speed of kWAS is affected by changing the value of k - especially in the context of comparing it with the other measurements, if they have variables that can be adjusted which affect their running time.
Since the proposed measure of similarity does not take into account the order patterns were executed in, it is unable to distinguish gestures which pass through the same positions at different times. For instance, performing a gesture normally and exactly in reverse would be identical in terms of the kWAS similarity measure.

Saturday, April 26, 2008

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation (Ip 2005)

Summary:
Cyber Composer is a music generation system that combines capturing hand gesture commands with music theory to produce an interactive approach to compositions. The paper supplies much background information in music theory, most of which defines variables of music which need to be set in a certain way to achieve a certain musical style. Several rules were introduced that shape music production based on classical music theory. The system determines which diatomic scale to use based on what musical key is specified. Given an initial chord, the degree of affinity between chord pairs is considered when determining subsequent chords. Cadence is avoided in general, allowing the user to trigger it. Harmony notes are used for the melody note at the accented beat to serve as a balance between using too many or too few harmony notes.
The system itself uses two CyberGloves, each with a polhemus receiver, to track hand and finger motion, position, and orientation. Hand signals are translated into motion triggers by the main program module and sent to the melody-generation module where MIDI output is produced according to the style template and music-theory-based rules. The user defines tempo and key before the interactive portion of the composition to be used in background music generation. During the interactive composition, melody notes are triggered when the wrist is straightened. Pitch is controlled by the relative height of the right hand compared to its position during the previous notes. Force of a note is determined by finger extension, while volume is controlled by finger flexion. Closing the left hand fingers triggers cadence. There was no concrete measurement of the system's ability to follow through on claims of allowing laypeople to express music in an interactive, innovative, and intuitive way.

Discussion:
Waving the wrist at every melody note seems like it could get tiring over the duration of a long song. Perhaps fixing the rhythm to remain constant (until recognition of a certain finger indicating that change was desired) would make the composition experience less repetitious. This change makes sense since rhythm is somewhat temporally consistent, meaning that it doesn't change a lot from moment to moment, and a change is usually not followed quickly by another change.
There is no real user study to speak of, but the creative nature of the system does not lend itself well to analysis. The approach is fairly novel, and the gestures control a variety of musical aspects, even if the mapping from gestures to their effects are not the most intuitive. Perhaps a user study should be done to see what gestures people would perform when wanting to signal changes in variables used by Cyber Composer when generating music.
As an improvement to MIDI-based sounds, a system could be devised to play back sound samples of an instrument using a library such as FMOD. For holding longer notes, a loop point within each sound sample could be set to allow arbitrarily long, yet fluid, note samples to be generated.

A Multi-Class Pattern Recognition System for Practical Finger Spelling Translation (Hernandez-Rebollar 2002)

Summary:
This paper recognizes static postures from the sign language alphabet from data collected through dual-axis accelerometers mounted on each of the fingers. The ten dimensional data (five fingers with two directions of accelerometer sensing each) collected for each letter is reduced to 3 dimensions. The X-global and Y-global features are extracted from the 10 dimensional data as well as the index finger's height. Plotting distributions of these three features in 3D helps visualize clusters of data. The 3D data are projected onto planes to better decide dividing points for the classifiers. Gestures are identified by a hierarchical three level classifier.
Ten posture readings were collected for each letter from each of from five volunteers. The posture readings for letters 'J' and 'Z' were collected at the end of the motion for the letter. Once the data has been analyzed and boundaries for the classifiers have been set up, they are programmed onto a microcontroller which is connected to a speech synthesizer so that the letters formed by hand postures can be heard. Twenty one of the 26 letters had a recognition rate of 100%. The worst recognition rate was 78% for the letter 'U'.

Discussion:
The statistics reported about the accelerometers' resolution and the diagram describing the hierarchical classifier are sufficiently detailed. One unique aspect of this system is that it uses a computer only for off-line analyzation of data and not during the online recognition. The use of a microcontroller for recognition makes the prototype system portable and closer to a system that could actually be used in real life than most of the other systems covered in the literature.

Thursday, April 24, 2008

Hand Tension as a Gesture Segmentation Cue (Harling 1997)

Summary:
The premise of this paper is that hand tension can be used to aid segmentation of a sequence of gestures. The proposed idea is to look for places in time where the hand tension is minimized and treat these minimums as places to to segment the sequence of gestures. The model used to measure hand tension treats each finger as a rod of fixed length attached to two ideal springs whose tensions are calculated using Hooke's law and summed to produce an estimate of finger tension. Overall hand tension is computed by simply summing the tensions of each finger of the hand.
Gestures are split into four classifications, combinations of static and dynamic hand postures and hand locations. An experiment using a Power Glove to collect data from simple sequences of sign language gestures was performed. The Power Glove's limitation of only 4 levels of measurable tension and the study size of only one user contributed to the lack of robustness of the data used in the experiment.

Discussion:
The definition of posture and gesture are distinguished by the exclusion or inclusion of hand location and orientation information. I think these definitions are influenced by the hardware used for gathering data.
A solution to segmentation problem posed by a referenced paper is to hold a posture for one minute before it is recognized. This is quite a while in computer time, and seems unnecessarily long.
The hand tension segmentation algorithm does not attempt to recognize a posture until after a minimum value of hand tension that follows a possible gesture is identified. This enforced delay in recognition may seem a bit slow to the user and pushes the response time just outside the real-time range.
The hand tension model described cannot correctly segment gestures in which individual fingers increase and decrease tension at the same rate since the finger tensions are summed and there would be no drop in overall tension.
The calculation of hand tension doesn't account for tension in the palm of the hand, which is an important part of overall hand tension. The hand tension model could be improved by considering the direction of tension between adjacent fingers. Adjacent fingers whose tension lies in opposite directions should increase overall tension.

Toward Natural Gesture/Speech HCI: A Case Study of Weather Narration Poddar 1998)

Summary:
This paper examined gestures in the natural domain of weather newscasting. The authors describe a natural human computer interface as including a gesture recognition module, a speech recognition module, and a display for providing audio and visual feedback.
An HMM-based gesture recognition system was used to analyze video of five weather persons. The features used in the gesture recognition were extracted from the video using kalman filtering and color segmentation. Specifically, the distance, radial and angular velocities of each hand with respect to the head were used to describe hand motion. Gestures were classified into three main categories: pointing, area, and contour. Gestures were also separated into phases of preparation, retraction, and actual stroke. This led to the choice of using left-to-right causal models with 3 states for the preparation, retraction, and point HMMs and 4 states for the contour and rest HMMs.
A study was done to determine the co-occurrence of spoken words with specific gestures. When the results of the co-occurrence analysis were applied to the data, recognition rates were higher for three of the four video sequences examined. However, recognition rates remained somewhat low overall, the highest was 75% accuracy.

Discussion:
Since the data comes from a weather newscasting environment, the background filtering has the potential to be much simpler. Instead of using the video input of the composited image with the weather map displayed in the background, the raw video feed of the newscaster in front of the blue or green screen could be used. The color-based filtering algorithm would have a much easier job since the static, singly colored background can be easily filtered out.
The paper mentions a probability that could be interpreted as the weather person's handedness. I don't think handedness would affect the hand used for a gesture as much as where the weather person happened to be standing in relation to the portion of the map being discussed at the time of the gesture or which hand held the clicker that advances the background image to the next video feed.

Tuesday, April 22, 2008

3D Object Modeling Using Spatial and Pictographic Gestures (Nishino 1998)

Summary:
The paper suggests creating an "image externalization loop", which is just a fancy way of saying they provide visual feedback of a virtual object as it is being manipulated. Instead of representing and manipulating objects at the vertex or polygonal level, mathematical expressions of the 3D objects were created which take deformation parameters which can be altered through gestures. The system is implemented in C, using the OpenGL library for rendering graphics. Finger joint angles are read by two CyberGloves and fed into a static posture recognizer to classify hand shape. Position and orientation data for both hands are read by a polyhemus tracker and sent to a dynamic gesture recognizer. The recognized hand shape determines what operation to perform, while the movement is used to determine how to adjust the deformation parameters. Segmentation is performed based on the assumption that static hand posture remains generally fixed during motion of the hands. The left hand is used as a frame of reference for the motion of the right hand, which scales or rotates objects. It seems only three static gestures were recognized, an open hand for "deform", a closed fist for "grasp", and a pointing index finger for "point".
A gesture learning and recognition function, called the Two-Handed dynamic Gesture environment Shell (TGSH) utilizes a self-organizing feature map algorithm to allow users to specify their preferred gestures.
One experiment involved recreating virtual objects to match real ones. The dimensions of real objects were scanned using Minolta's Vivid 700 3D digitizer. Users were able to recreate the models using the system.

Discussion:
The amount of data required to store an object by its deformation parameters is greatly reduced when compared to polygonal representations (factors of 1:670 and 1:970 were given for two example objects). The idea of using the deformation parameters that describe the objects as searchable categories in a database of 3D objects is interesting.
The authors identified the trade-off between quality of the rendering and interactivity. The number of polygons was limited to allow sufficient drawing rates and interactive responsiveness. With today's technology (10 years of improvements), I doubt the restrictions of 8,000 and 32,000 polygons per object would be as strict.
Since implicit functions are used to model objects, the collision detection can be computed easily, by simply evaluating the function at a location in space and comparing the result to a constant. This is much easier than performing collision detection with polygonal based models.
The fact that the blending function used produces G2 continuous surfaces is important since it allows reflections off the surface, including highlights, to be displayed with G1 continuity. Curves with less than G1 continuity do not look smooth and are easily detectable by humans.
The system only implements one of the superquadrics, the ellipsoid. There are three other superquadrics, namely hyperboloids of one and two sheets, and the toroid. Adding these additional primitives would be a natural extension for the existing system.

Sunday, April 20, 2008

Feature selection for grasp recognition from optical markers (Chang 2007)

Summary:
The goal of this paper is to create a system which can learn how to automatically create grasps for a robot manipulator provided examples from a human demonstrator. The positions of a set of 3D markers placed on the back of the hand are used as an orientation-independant set of characteristics which are used to classify hand poses into one of six categories of grasps. Since a linear logistic regression classifier is sufficient to predict grasps, one goal of the paper is to reduce the number of markers while retaining recognition rates.
To bypass considering the exponential number of possible feature sets, two greedy sequential wrapper methods were used to evaluate the addition or removal of single features. The wrapper methods' goal was to achieve local optimization by considering which single feature to add or remove, while retaining the highest recognition rate. To achieve hand orientation independance, three of the 3D markers were attached to a rigid part of the back of the hand that would serve to orient all the other marker positions.
A data set pairing grasp type with marker positions for grasps of 46 objects was collected from a total of three subjects. Using all 30 markers resulted in an accuracy of 91.5%, while using only five resulted in an accuracy of 86%. The set of markers selected by the backward selection differed from the set of markers selected by forward selection, and had a higher recognition rate.

Discussion:
I thought using the ratio of the highest class probability to the second highest class probability as a measure of confidence was a good idea.
I think the choice to initially use 30 markers was artificially high. The experimenters probably expected a significant decrease in marker count without a great loss in grasp recognition accuracy, simply because they chose a high number of markers.

Wednesday, April 16, 2008

Glove-TalkII - A Nerual-Network Interface which Maps Gestures to Parallel Format Speech Synthesizer Controls (Fels 1998)

Summary:
The Glove-TalkII system maps hand gestures to control parameters of a formant speech synthesizer using an adaptive interface based on neural networks. The right hand's position, orientation, and finger joints are sampled frequently to control the synthesizer parameters. When producing vowels, hand position determines what type of vowel sound to make. When producing consonants, the hand posture determines what consonant sound to make. The left hand wears a ContactGlove which measures nine points of contact and is used to signal stop consonants. A foot pedal controls overall volume of the system while hand height controls speech pitch.
Three neural networks were established to determine what sound to produce. The Vowel/Consonant network is a 10-5-1 feedforward network with sigmoid activations which determines whether the sound to be emitted should be a vowel or a consonant. The Vowel network is a 2-11-8 feedforward network which determines what vowel sound to produce, while the Consonant network is a 10-14-9 feedforward network that determines what consonant sound to produce. The 11 and 14 hidden units of these networks are normalized radial basis functions that correspond to one of 11 cardinal vowels and 14 static consonant phonemes, respectively. The 8 and 9 output units determine parameters to the formant synthesizer. The additional output of the consonant network determines the voicing parameter.

Discussion:
Only one person has been trained to use the system well enough to speak intelligibly. The subject was an accomplished piano player and trained for over 100 hours. Even so, the speed of speech is about 1.5 to 3 times slower than average. With such a time commitment and skill level as a barrier to using the system, I don't think it is likely to ever become practical.
The concept of stop consonants seemed to be introduced to help resolve the problem of sounds whose dynamics were too fast to be controlled by the user. This reminded me of a problem I am having with my final project which uses a Wii remote and a sensorbar to emulate a violin. Creating a smooth sounding transition between notes is not strait forward. I could try to introduce a special transition sound sample to be played in the midst of transitions that would create a more realistic sound.

Activity Recognition using Visual Tracking and RFID (Krahnstoever 2005)

Summary:
The paper combines vision based tracking with RFID sensors to gain more information about tasks being performed. A minimum of two color cameras track the position of the head and hands of a subject by using skin color matching techniques. The subject's state is modeled by three ellipsoids whose centers are represented in spherical coordinates. The volume in which tracking takes place is divided into three parts, one for the head and one for each hand, to reduce the number of particles needed for tracking. The presence and orientation of key objects are tracked using RFID tags. To fully determine an object's orientation, three antennas must be attached to an RFID tag.
The information from visual tracking and RFID tags works together to identify activities of the subject. For instance, the presence of a user's hand in the emitter field of an object and the movement of an RFID-tagged object can imply that the user has picked up and is holding that object.

Discussion:
RFID-tagged objects are much easier to recognize than they would be using visual appearance alone. The system is able to track objects that would be occluded using visual information alone. One of the interesting applications suggested involved multiple RFID tags per item. If one were placed on the lid and one on the container, the action of opening the container could be detected.
When computing the likelihood of a pixel belonging either to a body part in the foreground of an image or to the background, each pixel in the projection of the union of the ellipsoids tracking the head and hands must be evaluated. Instead of performing collision detection on ellipsoidal volumes, rectangular bounding box volumes are used since their collisions are simpler to compute. This technique of reducing the computational complexity by simplifying collision detection boundaries is frequently used in the field of physically based modeling, and quite appropriate to use in this case.

Tuesday, April 15, 2008

Online, Interactive Learning of Gestures for Human-Robot Interfaces (Lee 1996)

Summary:
This paper's goal was online teaching of new gestures with few examples utilizing discrete HMMs to model each gesture. The system uses a CyberGlove to recognize letters from the sign language alphabet. Input data is sampled from 18 sensors at a rate of 10 Hz and then reduced to a one-dimensional sequence of symbols using vector quantization. The vectors are encoded as a symbol from a codebook, which is a set of vectors representative of the domain of vectors to be encoded. The encoded vectors are sent to a Bakis HMM where the gesture, whose probability is above a certain threshold and not too near other gestures, is chosen as the one recognized. Segmentation is handled simply, by requiring the operator to hold the hand still for a short time between gestures. Only 14 letters were used, and the letters that were chosen are relatively easy to distinguish from each other compared to all possible sets of 14 letters. After two training examples, two experiments yielded 1% and 2.4 % classification errors.

Discussion:
In the section describing the interactive training, the condition of the system being "certain" about a classification of a gesture is described as being handled by performing the associated action. The specifics of being "certain" are not given, but could have been reported as a threshold probability of observed data matching the sequence of observations describing a gesture.
At the end of the interactive training section, the scenario of specifying a new "halt" action is outlined. The details of how the user specifies the label and action associated with a gesture recognition are not given. This guidance required by the user seemed to raise the difficulty level for interacting with the system.
The choice of Bakis HMMs do work for the gesture set of basically static ASL letters, but would be poor for classifying repetitive motions such as waving. This system could not be generalized to recognize all types of gestures since the states of Bakis HMMs are restricted to moving forward.
Since the codebook is generated offline before recognition begins, is the correlation of the codebook vectors to observed vectors decreased when new gestures are being learned? Teaching new gestures will probably change the domain of vectors to be encoded. This should probably alter the codebook, but it does not.

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas (Kim 2007)

Summary:
The goal of the paper is to enable robots to track a target in a real environment containing obstacles without a predefined map. A direction finding RFID reader uses a dual-direction antenna which determines orientation to an RF emitter by comparing the strengths of RF signals on two orthogonal spiral antennas. The strength of the RF signal is proportional to the antenna's orientation to, and the distance from, the transponder. The RFID system is based on commercial active sensor nodes produced by Ymatic Limited and attached to a Pioneer-3DX mobile robot.
A couple of different experiments were run, mapping the ratio of the strength of the RF signals to relative positions of the RF reader and transponder for different paths with different obstacles in the environment. The general trend of the signal ratio strength could be predicted using equations from classical electromagnetic theory, but the graphs were fairly noisy in practice. A different set of experiments involved having the robot with the dual sensing antennas follow a target robot which could move about. The pictures from these experiments show that the tracking works, although the path of the following robot does not adhere closely to the path of the leading robot.

Discussion:
One major advantage of the RFID technology is its lack of dependence on line of sight. This attribute is important for gesture recognition as well as robot tracking. Supposing accurate location information with sufficient granularity could be obtained from RFID sensors, I think they could be used in collecting gesture data. The small size of RFID tags make them easier to attach and more convenient to wear in a glove than ultrasonic sensors.
The fact that the directional antenna was rotated independantly from the robot suggests that the accuracy of measurements obtained from the RFID sensor would not be great enough for accurate gesture data collection. Since the magnitude of the ratio of the RFID signals varies quite a bit with environmental conditions, it would be difficult to get the kind of quality data needed for gesture recognition.

A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models (Qing 2005)

Summary:
This paper points out that dynamic gestures are stochastic and while repeating the same gesture will result in different measurements per trial, there are statistical properties which can describe the motion. The paper suggests using HMMs to extract defining characteristics from gesture data in order to classify the gestures. An Expectation-Modification algorithm (another name for Baum-Welch) is used to solve the HMMs. The characteristics examined by their HMM algorithm are the standard deviation of the angle variation of each finger.
An experiment examined the recognition of three gestures, named "Great", "Quote", and "Trigger", which were used to control the rotation of a cube about three axes. The GLUT framework was used to create a simple colored cube which could be rendered and rotated using the OpenGL library. Gesture data was sampled at 10 Hz from a CyberGlove and then normalized. The data are quantized and their standard deviation is calculated before being sent to the HMM to be recognized. Each gesture HMM was trained with 10 data sets each. An observed gesture sequence is compared to each of the trained gestures, and the one with the highest probability is reported as the most likely match.

Discussion:
The authors use several abbreviations throughout the paper, and their overuse just adds another step required for translating the meaning of the paper. The overview of HMMs is a bit brief, but does refer the reader to a paper by Rabiner covering a tutorial of HMMs. The choice to use HMMs was supposedly made to be able to recognize dynamic gestures instead of static postures, but the three gestures used in the study can be distinguished without the use of HMMs. The motion of the hands was unnecessary to distinguish the gestures since the position of the hands is different enough to separate them. Also, there was no user study or reporting of recognition accuracy rates.

Saturday, April 5, 2008

Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control (Sawada 1997)

Summary:
The motivation for this paper comes from the idea that humans performing gestures can be transferred to an electronic instrument to produce an emotional musical performance without incurring the difficulties of the human learning the technique for controlling a specific instrument. The assumption that emotion is better portrayed by the force applied to an object than by the position of the hand on an object is also crucial to the work, which uses three-dimensional acceleration sensors to capture gesture motion. The acceleration sensor data has poor quantitative reproducibility; so instead of using simple pattern matching, global features of the motions are extracted. In the study, the magnitude of the acceleration change in eight principal directions, rotational direction, intensity of motion, and z direction of motion are used as characteristics for gesture recognition.
A user must go through training the system, which builds up a standard pattern of data, before recognition takes place so thresholds can be set on an individual level. Gesture segmentation is triggered by the intensity of acceleration. Averages and standard deviations of acceleration data are calculated and used as a basis for comparison during gesture recognition. Ten types of gestures are recognized and used to control the performance of MIDI music. The user whose gesture data trained the system was able to perform gestures with a 100% recognition rate, while another user's gestures were misclassified for particular gestures.
A tempo prediction model based on the previous two tempos is used to allow for smoother performance of the music than could be achieved by the system waiting for the recognition of a human determined tempo.

Discussion:
The tempo recognition based on change in acceleration magnitude is more accurate than image based methods. This reduction of phase delay in the detection of tempo is important when the control issue is specifically related to timing as it is in music control. Some of the gestures used in the study did not seem naturally related to music composition, such as the star, triangle, and heart shaped motions. The paper also did not describe how gestures were mapped to effects on the music, besides the downward motion being mapped to tempo.

Wednesday, April 2, 2008

Enabling Fast and Effortless Customisation in Accelerometer Based Gesture Interaction (Mantyjarvi 2004)

Summary:
The motivation for this paper is that gesture recognizers should be customizable, easy and quick to train. In order to keep training examples low, original training gestures are augmented with noise-distorted training gestures. A SoapBox (Sensing, Operating, and Activating Peripheral Box) sensing device with a three-axis accelerometer is used for collecting motion data. In the preprocessing phase, gesture data is normalized to equal length and amplitude. The normalized data is converted to one dimensional prototype vectors using the k-means algorithm with a codebook size of eight. An ergodic HMM was used in lieu of a left-to-right model, despite the left-to-right model's grooming to fit sequential time-series data, because both models were reported to give similar results according to Hoffman. Also, the choice of five states for each HMM was chosen to reflect the work done by Hoffman. New training data was generated from original gesture data with random noise added according to either a uniform or Gaussian distribution. The beginning and end of each gesture was signaled with a button press and release, alleviating the problem of segmentation.
A single subject performed eight gestures for controlling a DVD player thirty times each. As expected, increasing the number of training vectors increased recognition accuracy, but four training vectors was determined sufficient, since performing six repetitions of eight gestures was considered to be a nuisance to train. In an experiment to test the effects of the signal to noise ratio of the random distributions on recognition accuracies, the best accuracy was achieved with the Gaussian distributed noise with an SNR of 3, while the best uniformly distributed noise was achieved with an SNR of 5. For either type of noise distribution, recognition accuracy rates of over 96% were obtained with just one original and three distorted training examples. The best performance (nearly 98%) requiring less than three original vectors came from using two original vectors and two or four Gaussian distributed noise distorted vectors.

Discussion:
The results comparing uniform and Gaussian noise are very close, and I do not see a distinct winner. There does not seem to be a clear pattern to which noise distribution performs better, but I do think the addition of noisy copies of training vectors help recognition slightly. Although the paper consistently refers to recognizing 3D gestures, all the example gestures in the study for controlling a DVD player could adequately be defined in two dimensions. In the preprocessing stage, the gesture data is linearly extrapolated or interpolated if the data sequence is too long or short. A different type of extrapolation or interpolation technique besides linear could be used to approximate a better fit to the data. I am not sure that the reliance on Hoffman for the type of HMM and number of states was well founded. The paper states that adding noise increases detectability in decision making under certain conditions, and cites a paper by Kay for justification. It might be interesting to see what those conditions are, and if this paper meets the criteria set by those conditions.

Monday, March 31, 2008

Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes (Wobbrock 2007)

Summary:
The $1 recognizer provides a solution to interactive recognition for use in human-computer interfaces that does not require in-depth algorithmic knowledge and intense mathematics to be usable. The $1 recognizer algorithm assumes an input of a sequence of points, outputs a list of most likely template matches, and can be broken down into four steps. First, the input point path is resampled so that the path is represented by N points that are equally spaced apart. Secondly, the points are rotated so that the vector from the centroid of all points in the gesture to the first point is at an angle of zero degrees. Next, the gesture points are scaled to a square and translated so that the centroid lies at the origin. Finally, the candidate points are compared to each template to determine which is the most likely match. The comparison is made between corresponding pairs of points in the candidate and template gestures by examining their distance apart, only after ensuring the candidate gesture is optimally aligned with the template gesture by performing a Golden Section Search to find the best rotational adjustment to the previously calculated indicative angle.
Ten users performed a total of 4800 gestures to compare the $1 recognizer to Rubine and Dynamic Time Warping (DTW) methods. $1 and DTW were significantly more accurate than Rubine, with recognition rates above 99%, compared to Rubine's 92%. The number of training examples and speed of gesture articulation did not produce significant effects across the three recognizers. DTW took considerably longer to run than either $1 or Rubine. Interestingly, $1 performed only 0.23% worse when the Golden Section Search portion of the algorithm was removed.

Discussion:
I think the recognition rates achieved by $1 are commendable, especially considering its ease of implementation. The simplicity of the concepts involved in its implementation are another plus for the $1 system. The algorithm presented in the appendix is the most clear and complete of any other description for implementing a gesture recognition system that I have encountered this semester. I see how the position data from an instrumented glove could be projected into two dimensions and used as input points for this system.
As presented in the paper, the $1 recognizer cannot distinguish gestures that depend on orientation. By recording the amount of rotation during the "indicative angle" rotation and the rotational optimization adjustment, the original orientation could be determined and used in the recognition process. For example, if the total rotation for each template gesture was recorded, and the total rotation for each candidate gesture was calculated, the values could be compared, and templates whose orientations were not within some tolerance level of the candidate could be eliminated.
The paper raises the question of whether the first point in a gesture is the best to use for finding the indicative angle. A possible alternative to the first point could be the centroid of the first n points. Experimentation could be done to find a suitable number for n, which would probably be related to N (possibly n = floor(N/10) or something similar?).

Sunday, February 17, 2008

An Architecture for Gesture-Based Control of Mobile Robots (Iba 1999)

Summary:
The paper looks at the use of hand gestures as a way for inexperienced users to direct robots. A difficulty arises in communicating the intention of a task to a robot instead of providing a set of actions that can be mimicked. The system described in the paper interprets gestures as either local or global instructions to a robot. Raw data captured from a CyberGlove is compacted into a 20-element vector describing 10 features and their first derivatives. The 20 dimensional data is reduced to one of 32 codewords. A 3-state HMM leverages temporal data from the gestures, turning sequences of codewords into one of six defined gestures, or an unspecified gesture. The HMM allows the classification of non-gestures (those whose probability is less than 0.5) due to the addition of an initial wait state in the HMM. The paper reports 96% accuracy in recognizing gestures using the HMM with the wait state.

Discussion:
The key insight in this paper's technique is the introduction of a "wait state" in the HMM that allows a gesture to be classified as not recognized. Adding more gestures to be recognized would reduce the effectiveness of a wait state, so the technique may not be well suited for distinguishing individual gestures from a large collection.
The distinction between gesture recognition and gesture spotting made at the beginning of the paper is important. The exclusion of non-gestures from an input sequence is a crucial facet of gesture spotting that distinguishes it from gesture recognition. This paper helped me form a better context for the possible applications of HMMs to gesture recognition. It was a good choice so soon after the Introduction to HMM paper by Rabiner and Juang.
The paper mentions that the length of an observation sequence should be limited to maintain a high probability that the observation sequence is produced by the model. However, no guidelines are given as how to select n, the number of most recent observations in a sequence.
It was not clear whether the waving motion to dodge obstacles temporarily placed the robot on a detour in the direction of the wave until the obstacle was passed, or permanently altered the destination of the robot. I suspect the latter was implemented, since it would be easier to divert the robot along a straight trajectory in th direction of the wave. Performing some kind of obstacle avoidance for a temporary detour could be a possible improvement for the system.

Sunday, February 3, 2008

The HoloSketch VR Sketching System (Deering 1996)

Summary:
The paper discusses a general 3D drawing program called HoloSketch which allows interaction using a specialized 3D input device. The display for the HoloSketch system is a 20" CRT monitor with a resolution of 960x680, relatively high compared to head mounted displays of that time. A pair of head-tracked shutter goggles allows the user to view a sequential stereo images as one 3D image, similar to a hologram. Although the HoloSketch system was optimized for the stereo CRT display, it could be used with other types of displays in a variety of applications.
The primary input device is a 3D wand that has six degrees of freedom. The menu system for HoloSketch takes advantage of the 3D positioning of the wand and is described as "fade-up", since the geometry fades away and the menu options fade into place just in front of where the geometry was located. Commands such as movement, rotation, and finer motion control require the wand input be combined with key presses to be interpreted. Elemental animation objects can be applied to primitives to achieve animation of position, orientation, scale, and color.
The paper mentions the possibility of interoperability with geometry from other 3D programs, but does not mention what specific packages or file formats are supported. The paper demonstrates that a general purpose drawing program can be extended into 3D, but suggests that the input devices required for 3D manipulation may not be accurate enough for mainstream users.

Discussion:
The extent to which accuracy was pursued in the project is commendable. I was impressed by the fact that the system corrects for changes in intraocular distance due to the rotation of the viewer's eyes in their sockets.
The design of the interface could be improved in two ways:
1. While an RGB cube is a uniquely 3D choice for color selection, I think the colors not on the exterior of the cube would be difficult to select for a novice user that doesn't have the intuition of the RGB colorspace. In particular, to get shades of gray colors, the user would have to move along the diagonal of the cube between the vertices corresponding to black and white. Looking at the exterior of the RGB color cube would not reveal any gray colors.
2. When selecting an object, the object flashes between it's intrinsic color and white. This visual cue does not work well for colors that are very near white and is imperceptible if the object is pure white. Instead, I think an improvement would flashing between the intrinsic color and its compliment. That way, a user will be able to notice the flashing regardless of its intrinsic color.
The ability to draw two-handed was added into the program because of the test user's request. The mouse motion was mapped to control the radius of the toothpaste primitive. Instead of requiring the mouse motion to control thickness, a possible extension would be to add a second wand-like input so that the interaction could take place more directly in 3D space. Using the mouse to vary the radius requires seeing the output to get feedback on the changes, whereas a second position-tracking device would provide one-to-one control.

An Introduction to Hidden Markov Models (Rabiner 1986)

Summary:
Markov chains try to solve the problem of recognizing sequences of observations by comparison to a signal model that is found to characterize some sequence of symbols. A first attempt suggests fitting linear system models over several short time segments to approximate the signal itself. This model is rejected in favor of the Hidden Markov Model.
A Hidden Markov Model consists of a number of states, each of which is associated with a probability distribution to decide what the next state to transition to should be. As each state is entered, an observation is recorded which is determined by a fixed, state-specific "observation probability distribution". In addition, there is an initial probability distribution which describes the likelihood of being in each state when observations begin.
Three types of problems associated with HMMs are outlined and their solutions are given. The first problem deals with finding the probability of an observation sequence given an observation sequence and a model. The second problem is concerned with finding the optimal sequence of states associated with a given observation sequence and is usually accomplished by considering each state individually. The third problem concentrates on how to optimize model parameters to best describe how the observed sequence occurs. The three problems can be solved using the forward-backward procedure, the Viterbi algorithm, and the Baum-Welch method, respectively.
The paper lists recognition results for three different HMM observation models (all of which are above 97%) when applied to single word recognition on vocabularies of 10 digits.

Discussion:
Hidden Markov Models are used in speech recognition, but the paper did not mention how a HMM could be applied to gesture recognition. I thought the coin tossing example was more confusing than simply defining the model. My introduction to Markov chains was in the context of weather prediction. Given a context for examples with inherent practical motivation seemed to make more sense than a non-motivated context, like the outcome of tossing coins.

American Sign Language Fingerspelling Recognition System (Allen 2003)

Summary:
The paper proposes a system to identify the letters of the alphabet which are represented in American sign language by static gestures. The letters J and Z were not able to be recognized due to their signs involving moving gestures. The recognition system collected data from an 18 sensor CyberGlove using Labview, loaded the data into a MatLab program to train a perceptron network, and then used a second MatLab program to match glove input with the most closely corresponding letter. Only one user was selected to train the neural network, resulting in an "up to 90%" accuracy rate. MatLab's default configuration to not run in real-time was cited as a main obstacle in development.

Discussion:
The paper admittedly "took a more narrow initial focus", so to the fact that only static gestures were recognized is understandable. Still, translating static poses of one user into spoken letters is very far from the long term goal of translating moving gestures into spoken English. The single result of 90% reported was not very illuminating.

Wednesday, January 30, 2008

Flexible Gesture Recognition for Immersive Virtual Environments (Deller 2006)

Summary:
The paper gives an overview of an inexpensive gesture recognition system which can be extended to recognize additional gestures and work with multiple users. The author argues that gestures are a natural form of interaction. When a person sees a human hand perform an action, the person immediately knows how to perform the action. There is much less mental translation involved with hand gestures than there is when a person sees a mouse pointer perform some action and tries to map the action into physical movement. The paper mentions some downsides of current approaches, focusing on fixed installations, expensive hardware, and requirement of high computational power -- all of which which combine to exclude the use of these solutions from ordinary working environments.
The solution described by the paper uses a P5 glove with an infrared based position and orientation tracker. The virtual environment is displayed in 2D on a higher resolution monitor and in stereoscopic 3D on a SeeReal C-I. Since the system learns postures by each user performing them, the system can easily adapt to multiple users. When determining what gesture is being performed, the main decision is made by analyzing the position of each of the fingers and the orientation of the hand. The relevance of hand orientation in identifying a gesture can be adjusted per gesture.

Discussion:
The author mentions interacting "in a natural way by just utilizing [one's] hands in ways [one] is already used to". This seems a bit vague. When interacting with physical objects, some people may prefer to slide them, others may prefer to pick them up and move them. These different methods of movement are intended to perform the same goal, and a choice must be made as to which gesture will be associated with an intended action in the program.
The gestures recognized by the example application seemed to be fairly limited and the associated actions seemed to lack innovation -- mimicking a mouse-based interface. For example, the "tapping" and "moving" gestures do not provide any more usefulness as gestures than the 2D clicking and dragging movements of a mouse. I doubt the user of such a gesture based system would gain any real productivity over a traditional interface.
The paper fails to mention quantitative results, serving as more of a proof-of-concept than a scientific study. The authors claim that the engine provides a fast and reliable gesture recognition interface on standard consumer computers, but fails to give specific data defining "reliable" and "standard consumer".

Monday, January 28, 2008

Environmental Technology - Making the Real World Virtual (Kreuger 1993)

Summary:

In this paper, Kreuger reviews some of his contributions to virtual environments, which are mainly concerned with compositing video with virtual environments to create a new scene. Some of the applications mentioned are multi-point control for sculpting and videoconferencing. Other applications include range-of-motion therapy and an educational tool for teaching children about the scientific method. He rejects the idea of using input which is unnatural to the user, such as a head-mounted display, and advocates a “come as you are” approach.

Discussion:

The paper was generally not very academic and lacking in the area of quantitative results. One point of interest to me was the fact that Kreuger coined the term “artificial reality”.

The author briefly mentions combining gesture input with speech recognition. The paper by Rabiner and Juang on hidden markov models uses the example application of speech recognition. One possible extension to the research may be combining speech and gesture data into a single HMM to more quickly or accurately recognize a user's intended command.