Gesture Recognition - Jace Miller: April 2008

Tuesday, April 29, 2008

3D Visual Detection of Correct NGT Sign Production (Lichtenauer 2007)

Summary:
In this paper, a 3D visual detection system is proposed to aid children in learning Dutch sign language. Two video cameras with wide angle lenses capture 640x480 resolution images at 25 frames per second. The user's head and hands are tracked based on following skin-colored segments of the image from frame to frame. Adaptive skin color modeling determines how skin color appears under different lighting conditions, but must first be initialized by selecting some pixels within the face to and a square of pixels surrounding the head. In practice, the colors of pixels showing skin were distributed in a bimodal manner due to two different light sources. Each modality was modeled separately to reduce mis-classification of pixels as skin colored or not. Classification of gestures is done based on fifty hand features. Dynamic Time Warping is used to find the level of time-correspondence between an input gesture and a reference gesture.
A set of 120 different NGT signs performed by 70 individuals are used to test the sign classification. Cross validation was performed to effectively increase the amount of test data many times. Overall, the true positive classification rate was 95%. Dynamic time warping only improved recognition of signs with repetitious motion.

Discussion:
The classification algorithm relies on knowing the approximate start and end times of a sign. Some other kind of segmentation scheme could be applied so that the hands must not always come to rest on the table between signs. As a teaching tool, resting between signs limits the learning to single words. Multi-gesture phrases and sentences cannot be recognized without segmentation.
The left and right blobs identified during skin detection tracking are assigned to the left and right hands, regardless of which hand is on which side of the body. So, crossing hands will cause skin blobs to be mis-labeled, giving rise to the possibility of classifying two distinct gestures (both consisting of the same basic motion, but one in which hands cross and one in which they do not) as the same.

Television control by hand gestures (Freeman 1994)

Summary:
This paper described a hand movement recognition system for controlling a television. The movements of an open hand were tracked and displayed to the user via a hand icon imposed on a secondary display, but which could overlay a television display in a more commercially viable prototype. The system captures images from a video camera which are downsampled to a 320 x 240 resolution. The images are processed on an HP 735 workstation which is connected via serial port to an All-In-One-12 remote control which can send commands to the television at a rate of about once per second. The recognition of the open hand trigger gesture takes about half a second. Once recognized, the hand position can be tracked at a rate of about five times per second. The images in the appendix were helpful in showing what the interface looked like and illustrating how background removal and local orientation video processing worked.

Discussion:
One scenario not covered in the paper was two users attempting to control the television simultaneously. The slider bar control interface for changing channels would not work well for a large number of channels because very fine control would be needed. A possible improvement would be to use a number pad or one slider per channel digit. The paper mentions that the users of the prototype system were excited, but I think it was due to the novelty of the system instead of an inherent fun factor. No results of the user study were given besides the fact that people seemed to enjoy using the system.

Monday, April 28, 2008

Invariant features for 3-D gesture recognition (Campbell 1996)

Summary:
This paper creates a systme that uses video input from two cameras to track head and hand positions in three dimensions. This data is used to construct view-invariant features from the motions recorded by the cameras. The video tracking system is able to track head and hand position to an accuracy of two centimeters at thirty frames per second. The HTK2.0 HMM toolkit from Entropic Research was used to analyze the feature vectors created from the data captured by the video tracking system.
A set of eighteen T'ai Chi gestures were used in an experiment to test the system. Each gesture is modeled as a forward chaining continuous HMM, using some function of the coordinates of the head and hand positions as features. Several functions of the coordinates are examined, including Cartesian positions (x,y,z) and velocities (dx,dy,dz), polar positions (r,theta,z) and velocities (dr, dTheta, dz), and curvature (ds, log(rho), dz). The features expressed in Cartesian coordinates included sets with and without the head data.
The feature vector with the best overall recognition rate was the polar velocity feature (dr, dTheta, dz) at 95%, followed closely by the polar velocity with radius-scaled theta feature at 93% and the polar velocity feature with redundant Cartesian velocity data at 92%. Not surprisingly, the authors concluded that Cartesian velocity features perform better with data containing translational shifts and polar velocity features perform better when the data contain rotations.

Discussion:
T'ai Chi was an interesting choice of gestures to recognize. From the images of gestures in figure 1, it seems the head position remains relatively fixed throughout all the gestures. If the head remains fixed, head position does not seem like it would be a very useful feature to examine when recognizing gestures.
The T'ai Chi gestures were performed seated due to space constraints. This undoubtedly affected the normal head movement produced when performing the gestures while not seated. Performing the T'ai Chi gestures while not seated would most likely produce more translation and possibly allow the Cartesian velocity feature to achieve higher recognition rates.
The data was resampled to a rate of 30 Hz, but the actual sample rate varied from 15 to 30 Hz. A cubic spline was used for resampling; it might be interesting to see the effects on data and recognition rates if different data fitting curves would have been used.
Shifted and rotated versions of the gesture data were collected during the experiment, but some of the rotation was added synthetically. I wonder why the shifted and rotated data were collected at all, if they could have been generated synthetically as well.

FreeDrawer - A Free-Form Sketching System on the Responsive Workbench (Wesche 2001)

Summary:
FreeDrawer is a collection of 3D curve drawing and deformation tools targeted towards product design. The Responsive Workbench program models objects using Catmull-Clark surfaces and B-spline curves as opposed to Constructive Solid Geometry or voxel-based virtual clay. Deformations allowed by B-spline curves and Catmull-Clark surfaces are beneficial for product design, in which original sketches need to be modified to produce the final shape.
The first curve of an object can be freely drawn anywhere in space, but subsequent curves are woven into the existing curve network - being altered if they do not intersect an existing curve. When a curve is modified, the rest of the model can be altered in two ways. In local mode, the modified curve and curves adjacent to it are altered. In global mode, all curves are altered recursively. Surfaces can be filled in with Catmull-Clark or Kuriyama surfaces when a closed loop of curves is selected.
The non-dominant hand is responsible for translating and rotating the model, while the dominant hand is responsible for editing and deforming the model. An original 3D widget with multiple pointers spread out in a fan shape support multiple editing tools - each pointer is mapped to a different tool. A sketch of a chair seat took an experienced user approximately 15 minutes to create.

Discussion:
The idea of the designer drawing the full sweep of a curve across both halves of a mirrored plane and the program averaging each to match symmetrically is new to me. Previously, I had only seen modeling programs where the shape being drawn is mirrored as it is being drawn, and the modeler is only required to draw half of the shape.
The subdivision level for surfaces are chosen so that each curve piece has the same number of segments. This makes creating the subdivided surfaces easier to compute, but is not the best option for models containing areas of small detail and large areas without much detail. In this case, the large areas of less detail are most likely over defined.
Neighboring surfaces are connected with only G1 continuity. For several real-world modeling applications, higher continuity connection is required. Usually, G2 continuity is the minimum smoothness demanded for surfaces which reflect or which need to have good air flow characteristics. If this system were to be used for real-world design, more work would have to be done so that higher continuity can be achieved. Unfortunately, the option of higher continuity may make the system slightly less easy to use.
While the fan-shaped, multi-function widget is novel and may reduce tool selection time for expert users, I think it would produce confusion and tool selection errors in novice users.

Shape Your Imagination: Iconic Gestural-Based Interaction (Marsh 1998)

Summary:
Iconic gestures are a natural, intuitive, and effective way to convey spatial information. This paper separates iconic gestures into virtual and substitutive categories. Virtual depiction is defined as using the spatial relationship between the hands as an outline or tracing the picture of an object or shape. Substitutive depiction is defined as shaping the hands to match the form of an object as if it were being held or grasped in the hands.
A user study is performed to determine if humans use iconic hand gestures during non-verbal communication of shapes and objects, and if so, what kind of gestures are used with what frequency, and how many hands are involved in the gestures. Twelve volunteers were seated in a room and asked to non-verbally describe fifteen shapes and objects using hand gestures. Care was taken not to influence the volunteers' descriptions or otherwise influence their gestures by providing images of the objects.
Subjects used two-handed gestures 88.1% of the time, preferring virtual depiction. 75% of gestures were performed using purely virtual depiction, 17.9% were performed using purely substitutive depiction, and 7.1% used a combination of both. Only one object, the circle, was performed using primarily one-handed virtual depiction gestures.

Discussion:
The results of this paper were not surprising. I suppose it was still good to confirm the ideas experimentally, although a larger group of subjects could have been used to obtain more statistically significant results. Also, since the goal was to see how gestures are used naturally, an experiment could be designed to observe when gestures accompany certain spoken words or phrases. Telling the subjects that they must communicate completely non-verbally does not seem like a situation in which the most natural gestures would be made.
In general, users found describing 2D shapes easier than 3D shapes. I think an analogy to this insight exists in the recognition realm. Sketch recognition in two dimensions is an easier problem than gesture recognition in three dimensions.

A Dynamic Gesture Recognition System for the Korean Sign Language (Kim 1996)

Summary:
The paper describes a system for recognizing Korean Sign Language (KSL) gestures and converting them into text. The system employs two ten-sensor VPL Data-Gloves to measure the bend in the joints of each digit on both hands. The gloves also sense the position and orientation of each hand relative to a fixed source. The hand gestures need to be recognized regardless of their position relative to an initial position which varies, the position data is recorded as the difference between the previous position and the current position. For the paper, 25 of around 6000 KSL gestures were analyzed and partitioned into ten sets based on their general direction. To determine which of the ten direction categories the motion will be classified as during recognition, the change over the five most recent readings of region data are examined. Hand postures are recognized by applying the technique of Fuzzy Min-Max Neural (FMMN) Networks. Input from the data glove are identified as one of the direction classes and then recognized by the FMMN network. The system classifies gestures correctly almost 85% of the time.

Discussion:
Figure 8 shows a diagram of the min-max neural network used for classification. There are ten input nodes at the bottom, one for each of the flex angles measured from the fingers. However, there should be fourteen class noes at the top of the diagram instead of ten since there are fourteen posture classes.
When motion data is expressed in its compressed form, the order of region data is preserved, but data relating to the length in time of each region is lost.
Although the paper states that the FMMN network requires no pre-learning about posture class and has on-line adaptability, I'd say the basic idea of neural networks does require learning since that is how the weights of the network are adjusted.
The mis-classifications are partially blamed on abnormal motions in gestures and postures, but dealing with data that exhibits less than ideal characteristics seems to be part of the point of applying complex solutions to gesture recognition.

Sunday, April 27, 2008

Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs (Song 2006)

Summary:
Most gesture segmentation and recognition systems examine gesture patterns for places to segment the data, which must be done before recognition. This requires that the gesture be complete before recognition begins, which delays the rate at which recognition can be performed. This work attempts to use a forward spotting scheme to perform gesture segmentation and recognition at the same time. A sliding window of observations is used to compute the probability of the observations within the window being a gesture or a non-gesture. The window averages affects of abruptly changing features. Accumulative HMMs are used in the gesture recognition portion of the system; they accept partial posture segments for which the most likely posture is determined. The final posture is determined by majority voting between the partial posture candidates.
An experiment using eight gestures to open and close curtains and turn lights on and off was performed to compare automatic and manual gesture segmentation and recognition. For segmentation, the automatic threshold method was reported to be 17.62% better than the manual threshold method. For recognition, the automatic method was reported to be 3.75% more accurate than the manual method.

Discussion:
The application of controlling curtains and lights in a home seemed a bit non-useful to me, since a gesture recognition system in the living room would probably receive a lot of signals that were not intended to be interpreted by the system. Installing such a system in a real home could cause some interesting scenarios for a family playing charades.
The results of the experiment depend on the manual recognition process, which was not described in the paper. Since there was no description of how the manual processes were performed, I don't know how to interpret the reported benefits of the automatic methods.
The gestures chosen for recognition are very simple and could be recognized by simply detecting whether a hand of the person passed some threshold of the camera space. HMMs are too complex of a solution for the type of gestures in the experiment.

A Survey of POMDP Applications (Cassandra 1998)

Summary:
The partially observable Markov decision process (POMDP) model is heavily based on regular Markov decision process models which have been the focus of other papers we have read dealing with Hidden Markov models. The defining parts of a POMDP model are three finite sets and three functions. Similar to HMMs, there is a set of states, a set of actions, and a set of observations. The transition and observation functions are the familiar probability distributions of HMMs. The main difference is the addition of an immediate reward function which gives the immediate benefit for performing each action while in each state.
The paper gives an overview of several application areas for the POMDP model. In the area of industrial applications, the example of finding a policy for machine maintenance is given. This is one of the oldest applications of the POMDP model and it fits the problem very well since the observations are probabilistically related to internal states. Other industrial applications include structural inspection, elevator control policies, and the fishery industry. Autonomous robots are given as another possible application of the POMDP model. Currently, finding control rules for robots using the method is done at a higher level of abstraction than at the sensor and actuator level. The paper mentions that it might be possible to use a hierarchical arrangement of POMDP models to get the model functioning closer to the level of the hardware. Other scientific applications include behavioral ecology and machine vision. Business applications include network troubleshooting, distributed database queries, and marketing. Some military and social applications are also mentioned.

Discussion:
One observation from the machine vision application example is that POMDP models work best in special purpose visual systems where the domain has been restricted. One of the special systems mentioned was a gesture recognizer which attempts pattern recognition. Since this area is the most closely related to instrumented gesture recognition, I think it may be beneficial to read the paper referenced in this section (Active Gesture Recognition using Partially Observable Markov Decision Processes by Trevor Darrell and Alex Pentland).
One of the more interesting application areas to me was education where the internal mental state of an individual is the model. The reward function could even be applied at an individual level, taking into consideration a student's learning style. I think the limitation of discretizing a space of concepts makes the application to learning not totally accurate. Concepts are inherently difficult to define. Some of the application areas seemed somewhat far-fetched in that too much information would be required to build an accurate model. In addition to these theoretical limitations, the practical limitations of representing and performing computation on the models is even more severe. Because of the difficulty of correct POMDP modeling, algorithms which consider characteristics of the problem area must be used to achieve timely results.

A Similarity Measure for Motion Stream Segmentation and Recognition (Chuanjun 2005)

Summary:
The paper proposes a similarity measure to deal with problems in recognizing and segmenting motion streams. Specifically, the problems dealt with are differing lengths of motion duration, different variations in attributes of similar motions, and the fact that motion streams are continuous with no obvious "pause" at which to segment. The model presented represents a motion by a matrix in which columns correspond to joint angles and rows correspond to samples in time. Singular value decomposition is applied to two of these motion matrices of varying rows to determine a measurement of similarity. Eigenvalues and eigenvectors of the matrices representing both streaming motion input and known motion patterns are calculated and compared. The more similar two matrices are, the closer their eigenvectors are to being parallel and the closer their eigenvalues are to being proportional to each other.
The measurement of similarity is referred to as k Weighted Angular Similarity (kWAS). Data was collected from both a CyberGlove and a Vicon motion capture system to test the performance of the kWAS measurement. Although the first two eigenvalues (k = 2 in terms of kWAS) account for more than 95% of the sums of all the eigenvalues, considering the first two eigenvectors was not sufficient to recognize gestures at an acceptable rate. Increasing the value of k to six increased the accuracy of recognition for CyberGlove data to 94% and motion capture data to 94.6%. These results were higher than two other similarity measures called Eros and MAS.

Discussion:
Not only was the kWAS measure more accurate, it required about the same running time as the MAS measurement and was twice as fast as the Eros measurement. It might be interesting to see how the speed of kWAS is affected by changing the value of k - especially in the context of comparing it with the other measurements, if they have variables that can be adjusted which affect their running time.
Since the proposed measure of similarity does not take into account the order patterns were executed in, it is unable to distinguish gestures which pass through the same positions at different times. For instance, performing a gesture normally and exactly in reverse would be identical in terms of the kWAS similarity measure.

Saturday, April 26, 2008

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation (Ip 2005)

Summary:
Cyber Composer is a music generation system that combines capturing hand gesture commands with music theory to produce an interactive approach to compositions. The paper supplies much background information in music theory, most of which defines variables of music which need to be set in a certain way to achieve a certain musical style. Several rules were introduced that shape music production based on classical music theory. The system determines which diatomic scale to use based on what musical key is specified. Given an initial chord, the degree of affinity between chord pairs is considered when determining subsequent chords. Cadence is avoided in general, allowing the user to trigger it. Harmony notes are used for the melody note at the accented beat to serve as a balance between using too many or too few harmony notes.
The system itself uses two CyberGloves, each with a polhemus receiver, to track hand and finger motion, position, and orientation. Hand signals are translated into motion triggers by the main program module and sent to the melody-generation module where MIDI output is produced according to the style template and music-theory-based rules. The user defines tempo and key before the interactive portion of the composition to be used in background music generation. During the interactive composition, melody notes are triggered when the wrist is straightened. Pitch is controlled by the relative height of the right hand compared to its position during the previous notes. Force of a note is determined by finger extension, while volume is controlled by finger flexion. Closing the left hand fingers triggers cadence. There was no concrete measurement of the system's ability to follow through on claims of allowing laypeople to express music in an interactive, innovative, and intuitive way.

Discussion:
Waving the wrist at every melody note seems like it could get tiring over the duration of a long song. Perhaps fixing the rhythm to remain constant (until recognition of a certain finger indicating that change was desired) would make the composition experience less repetitious. This change makes sense since rhythm is somewhat temporally consistent, meaning that it doesn't change a lot from moment to moment, and a change is usually not followed quickly by another change.
There is no real user study to speak of, but the creative nature of the system does not lend itself well to analysis. The approach is fairly novel, and the gestures control a variety of musical aspects, even if the mapping from gestures to their effects are not the most intuitive. Perhaps a user study should be done to see what gestures people would perform when wanting to signal changes in variables used by Cyber Composer when generating music.
As an improvement to MIDI-based sounds, a system could be devised to play back sound samples of an instrument using a library such as FMOD. For holding longer notes, a loop point within each sound sample could be set to allow arbitrarily long, yet fluid, note samples to be generated.

A Multi-Class Pattern Recognition System for Practical Finger Spelling Translation (Hernandez-Rebollar 2002)

Summary:
This paper recognizes static postures from the sign language alphabet from data collected through dual-axis accelerometers mounted on each of the fingers. The ten dimensional data (five fingers with two directions of accelerometer sensing each) collected for each letter is reduced to 3 dimensions. The X-global and Y-global features are extracted from the 10 dimensional data as well as the index finger's height. Plotting distributions of these three features in 3D helps visualize clusters of data. The 3D data are projected onto planes to better decide dividing points for the classifiers. Gestures are identified by a hierarchical three level classifier.
Ten posture readings were collected for each letter from each of from five volunteers. The posture readings for letters 'J' and 'Z' were collected at the end of the motion for the letter. Once the data has been analyzed and boundaries for the classifiers have been set up, they are programmed onto a microcontroller which is connected to a speech synthesizer so that the letters formed by hand postures can be heard. Twenty one of the 26 letters had a recognition rate of 100%. The worst recognition rate was 78% for the letter 'U'.

Discussion:
The statistics reported about the accelerometers' resolution and the diagram describing the hierarchical classifier are sufficiently detailed. One unique aspect of this system is that it uses a computer only for off-line analyzation of data and not during the online recognition. The use of a microcontroller for recognition makes the prototype system portable and closer to a system that could actually be used in real life than most of the other systems covered in the literature.

Thursday, April 24, 2008

Hand Tension as a Gesture Segmentation Cue (Harling 1997)

Summary:
The premise of this paper is that hand tension can be used to aid segmentation of a sequence of gestures. The proposed idea is to look for places in time where the hand tension is minimized and treat these minimums as places to to segment the sequence of gestures. The model used to measure hand tension treats each finger as a rod of fixed length attached to two ideal springs whose tensions are calculated using Hooke's law and summed to produce an estimate of finger tension. Overall hand tension is computed by simply summing the tensions of each finger of the hand.
Gestures are split into four classifications, combinations of static and dynamic hand postures and hand locations. An experiment using a Power Glove to collect data from simple sequences of sign language gestures was performed. The Power Glove's limitation of only 4 levels of measurable tension and the study size of only one user contributed to the lack of robustness of the data used in the experiment.

Discussion:
The definition of posture and gesture are distinguished by the exclusion or inclusion of hand location and orientation information. I think these definitions are influenced by the hardware used for gathering data.
A solution to segmentation problem posed by a referenced paper is to hold a posture for one minute before it is recognized. This is quite a while in computer time, and seems unnecessarily long.
The hand tension segmentation algorithm does not attempt to recognize a posture until after a minimum value of hand tension that follows a possible gesture is identified. This enforced delay in recognition may seem a bit slow to the user and pushes the response time just outside the real-time range.
The hand tension model described cannot correctly segment gestures in which individual fingers increase and decrease tension at the same rate since the finger tensions are summed and there would be no drop in overall tension.
The calculation of hand tension doesn't account for tension in the palm of the hand, which is an important part of overall hand tension. The hand tension model could be improved by considering the direction of tension between adjacent fingers. Adjacent fingers whose tension lies in opposite directions should increase overall tension.

Toward Natural Gesture/Speech HCI: A Case Study of Weather Narration Poddar 1998)

Summary:
This paper examined gestures in the natural domain of weather newscasting. The authors describe a natural human computer interface as including a gesture recognition module, a speech recognition module, and a display for providing audio and visual feedback.
An HMM-based gesture recognition system was used to analyze video of five weather persons. The features used in the gesture recognition were extracted from the video using kalman filtering and color segmentation. Specifically, the distance, radial and angular velocities of each hand with respect to the head were used to describe hand motion. Gestures were classified into three main categories: pointing, area, and contour. Gestures were also separated into phases of preparation, retraction, and actual stroke. This led to the choice of using left-to-right causal models with 3 states for the preparation, retraction, and point HMMs and 4 states for the contour and rest HMMs.
A study was done to determine the co-occurrence of spoken words with specific gestures. When the results of the co-occurrence analysis were applied to the data, recognition rates were higher for three of the four video sequences examined. However, recognition rates remained somewhat low overall, the highest was 75% accuracy.

Discussion:
Since the data comes from a weather newscasting environment, the background filtering has the potential to be much simpler. Instead of using the video input of the composited image with the weather map displayed in the background, the raw video feed of the newscaster in front of the blue or green screen could be used. The color-based filtering algorithm would have a much easier job since the static, singly colored background can be easily filtered out.
The paper mentions a probability that could be interpreted as the weather person's handedness. I don't think handedness would affect the hand used for a gesture as much as where the weather person happened to be standing in relation to the portion of the map being discussed at the time of the gesture or which hand held the clicker that advances the background image to the next video feed.

Tuesday, April 22, 2008

3D Object Modeling Using Spatial and Pictographic Gestures (Nishino 1998)

Summary:
The paper suggests creating an "image externalization loop", which is just a fancy way of saying they provide visual feedback of a virtual object as it is being manipulated. Instead of representing and manipulating objects at the vertex or polygonal level, mathematical expressions of the 3D objects were created which take deformation parameters which can be altered through gestures. The system is implemented in C, using the OpenGL library for rendering graphics. Finger joint angles are read by two CyberGloves and fed into a static posture recognizer to classify hand shape. Position and orientation data for both hands are read by a polyhemus tracker and sent to a dynamic gesture recognizer. The recognized hand shape determines what operation to perform, while the movement is used to determine how to adjust the deformation parameters. Segmentation is performed based on the assumption that static hand posture remains generally fixed during motion of the hands. The left hand is used as a frame of reference for the motion of the right hand, which scales or rotates objects. It seems only three static gestures were recognized, an open hand for "deform", a closed fist for "grasp", and a pointing index finger for "point".
A gesture learning and recognition function, called the Two-Handed dynamic Gesture environment Shell (TGSH) utilizes a self-organizing feature map algorithm to allow users to specify their preferred gestures.
One experiment involved recreating virtual objects to match real ones. The dimensions of real objects were scanned using Minolta's Vivid 700 3D digitizer. Users were able to recreate the models using the system.

Discussion:
The amount of data required to store an object by its deformation parameters is greatly reduced when compared to polygonal representations (factors of 1:670 and 1:970 were given for two example objects). The idea of using the deformation parameters that describe the objects as searchable categories in a database of 3D objects is interesting.
The authors identified the trade-off between quality of the rendering and interactivity. The number of polygons was limited to allow sufficient drawing rates and interactive responsiveness. With today's technology (10 years of improvements), I doubt the restrictions of 8,000 and 32,000 polygons per object would be as strict.
Since implicit functions are used to model objects, the collision detection can be computed easily, by simply evaluating the function at a location in space and comparing the result to a constant. This is much easier than performing collision detection with polygonal based models.
The fact that the blending function used produces G2 continuous surfaces is important since it allows reflections off the surface, including highlights, to be displayed with G1 continuity. Curves with less than G1 continuity do not look smooth and are easily detectable by humans.
The system only implements one of the superquadrics, the ellipsoid. There are three other superquadrics, namely hyperboloids of one and two sheets, and the toroid. Adding these additional primitives would be a natural extension for the existing system.

Sunday, April 20, 2008

Feature selection for grasp recognition from optical markers (Chang 2007)

Summary:
The goal of this paper is to create a system which can learn how to automatically create grasps for a robot manipulator provided examples from a human demonstrator. The positions of a set of 3D markers placed on the back of the hand are used as an orientation-independant set of characteristics which are used to classify hand poses into one of six categories of grasps. Since a linear logistic regression classifier is sufficient to predict grasps, one goal of the paper is to reduce the number of markers while retaining recognition rates.
To bypass considering the exponential number of possible feature sets, two greedy sequential wrapper methods were used to evaluate the addition or removal of single features. The wrapper methods' goal was to achieve local optimization by considering which single feature to add or remove, while retaining the highest recognition rate. To achieve hand orientation independance, three of the 3D markers were attached to a rigid part of the back of the hand that would serve to orient all the other marker positions.
A data set pairing grasp type with marker positions for grasps of 46 objects was collected from a total of three subjects. Using all 30 markers resulted in an accuracy of 91.5%, while using only five resulted in an accuracy of 86%. The set of markers selected by the backward selection differed from the set of markers selected by forward selection, and had a higher recognition rate.

Discussion:
I thought using the ratio of the highest class probability to the second highest class probability as a measure of confidence was a good idea.
I think the choice to initially use 30 markers was artificially high. The experimenters probably expected a significant decrease in marker count without a great loss in grasp recognition accuracy, simply because they chose a high number of markers.

Wednesday, April 16, 2008

Glove-TalkII - A Nerual-Network Interface which Maps Gestures to Parallel Format Speech Synthesizer Controls (Fels 1998)

Summary:
The Glove-TalkII system maps hand gestures to control parameters of a formant speech synthesizer using an adaptive interface based on neural networks. The right hand's position, orientation, and finger joints are sampled frequently to control the synthesizer parameters. When producing vowels, hand position determines what type of vowel sound to make. When producing consonants, the hand posture determines what consonant sound to make. The left hand wears a ContactGlove which measures nine points of contact and is used to signal stop consonants. A foot pedal controls overall volume of the system while hand height controls speech pitch.
Three neural networks were established to determine what sound to produce. The Vowel/Consonant network is a 10-5-1 feedforward network with sigmoid activations which determines whether the sound to be emitted should be a vowel or a consonant. The Vowel network is a 2-11-8 feedforward network which determines what vowel sound to produce, while the Consonant network is a 10-14-9 feedforward network that determines what consonant sound to produce. The 11 and 14 hidden units of these networks are normalized radial basis functions that correspond to one of 11 cardinal vowels and 14 static consonant phonemes, respectively. The 8 and 9 output units determine parameters to the formant synthesizer. The additional output of the consonant network determines the voicing parameter.

Discussion:
Only one person has been trained to use the system well enough to speak intelligibly. The subject was an accomplished piano player and trained for over 100 hours. Even so, the speed of speech is about 1.5 to 3 times slower than average. With such a time commitment and skill level as a barrier to using the system, I don't think it is likely to ever become practical.
The concept of stop consonants seemed to be introduced to help resolve the problem of sounds whose dynamics were too fast to be controlled by the user. This reminded me of a problem I am having with my final project which uses a Wii remote and a sensorbar to emulate a violin. Creating a smooth sounding transition between notes is not strait forward. I could try to introduce a special transition sound sample to be played in the midst of transitions that would create a more realistic sound.

Activity Recognition using Visual Tracking and RFID (Krahnstoever 2005)

Summary:
The paper combines vision based tracking with RFID sensors to gain more information about tasks being performed. A minimum of two color cameras track the position of the head and hands of a subject by using skin color matching techniques. The subject's state is modeled by three ellipsoids whose centers are represented in spherical coordinates. The volume in which tracking takes place is divided into three parts, one for the head and one for each hand, to reduce the number of particles needed for tracking. The presence and orientation of key objects are tracked using RFID tags. To fully determine an object's orientation, three antennas must be attached to an RFID tag.
The information from visual tracking and RFID tags works together to identify activities of the subject. For instance, the presence of a user's hand in the emitter field of an object and the movement of an RFID-tagged object can imply that the user has picked up and is holding that object.

Discussion:
RFID-tagged objects are much easier to recognize than they would be using visual appearance alone. The system is able to track objects that would be occluded using visual information alone. One of the interesting applications suggested involved multiple RFID tags per item. If one were placed on the lid and one on the container, the action of opening the container could be detected.
When computing the likelihood of a pixel belonging either to a body part in the foreground of an image or to the background, each pixel in the projection of the union of the ellipsoids tracking the head and hands must be evaluated. Instead of performing collision detection on ellipsoidal volumes, rectangular bounding box volumes are used since their collisions are simpler to compute. This technique of reducing the computational complexity by simplifying collision detection boundaries is frequently used in the field of physically based modeling, and quite appropriate to use in this case.

Tuesday, April 15, 2008

Online, Interactive Learning of Gestures for Human-Robot Interfaces (Lee 1996)

Summary:
This paper's goal was online teaching of new gestures with few examples utilizing discrete HMMs to model each gesture. The system uses a CyberGlove to recognize letters from the sign language alphabet. Input data is sampled from 18 sensors at a rate of 10 Hz and then reduced to a one-dimensional sequence of symbols using vector quantization. The vectors are encoded as a symbol from a codebook, which is a set of vectors representative of the domain of vectors to be encoded. The encoded vectors are sent to a Bakis HMM where the gesture, whose probability is above a certain threshold and not too near other gestures, is chosen as the one recognized. Segmentation is handled simply, by requiring the operator to hold the hand still for a short time between gestures. Only 14 letters were used, and the letters that were chosen are relatively easy to distinguish from each other compared to all possible sets of 14 letters. After two training examples, two experiments yielded 1% and 2.4 % classification errors.

Discussion:
In the section describing the interactive training, the condition of the system being "certain" about a classification of a gesture is described as being handled by performing the associated action. The specifics of being "certain" are not given, but could have been reported as a threshold probability of observed data matching the sequence of observations describing a gesture.
At the end of the interactive training section, the scenario of specifying a new "halt" action is outlined. The details of how the user specifies the label and action associated with a gesture recognition are not given. This guidance required by the user seemed to raise the difficulty level for interacting with the system.
The choice of Bakis HMMs do work for the gesture set of basically static ASL letters, but would be poor for classifying repetitive motions such as waving. This system could not be generalized to recognize all types of gestures since the states of Bakis HMMs are restricted to moving forward.
Since the codebook is generated offline before recognition begins, is the correlation of the codebook vectors to observed vectors decreased when new gestures are being learned? Teaching new gestures will probably change the domain of vectors to be encoded. This should probably alter the codebook, but it does not.

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas (Kim 2007)

Summary:
The goal of the paper is to enable robots to track a target in a real environment containing obstacles without a predefined map. A direction finding RFID reader uses a dual-direction antenna which determines orientation to an RF emitter by comparing the strengths of RF signals on two orthogonal spiral antennas. The strength of the RF signal is proportional to the antenna's orientation to, and the distance from, the transponder. The RFID system is based on commercial active sensor nodes produced by Ymatic Limited and attached to a Pioneer-3DX mobile robot.
A couple of different experiments were run, mapping the ratio of the strength of the RF signals to relative positions of the RF reader and transponder for different paths with different obstacles in the environment. The general trend of the signal ratio strength could be predicted using equations from classical electromagnetic theory, but the graphs were fairly noisy in practice. A different set of experiments involved having the robot with the dual sensing antennas follow a target robot which could move about. The pictures from these experiments show that the tracking works, although the path of the following robot does not adhere closely to the path of the leading robot.

Discussion:
One major advantage of the RFID technology is its lack of dependence on line of sight. This attribute is important for gesture recognition as well as robot tracking. Supposing accurate location information with sufficient granularity could be obtained from RFID sensors, I think they could be used in collecting gesture data. The small size of RFID tags make them easier to attach and more convenient to wear in a glove than ultrasonic sensors.
The fact that the directional antenna was rotated independantly from the robot suggests that the accuracy of measurements obtained from the RFID sensor would not be great enough for accurate gesture data collection. Since the magnitude of the ratio of the RFID signals varies quite a bit with environmental conditions, it would be difficult to get the kind of quality data needed for gesture recognition.

A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models (Qing 2005)

Summary:
This paper points out that dynamic gestures are stochastic and while repeating the same gesture will result in different measurements per trial, there are statistical properties which can describe the motion. The paper suggests using HMMs to extract defining characteristics from gesture data in order to classify the gestures. An Expectation-Modification algorithm (another name for Baum-Welch) is used to solve the HMMs. The characteristics examined by their HMM algorithm are the standard deviation of the angle variation of each finger.
An experiment examined the recognition of three gestures, named "Great", "Quote", and "Trigger", which were used to control the rotation of a cube about three axes. The GLUT framework was used to create a simple colored cube which could be rendered and rotated using the OpenGL library. Gesture data was sampled at 10 Hz from a CyberGlove and then normalized. The data are quantized and their standard deviation is calculated before being sent to the HMM to be recognized. Each gesture HMM was trained with 10 data sets each. An observed gesture sequence is compared to each of the trained gestures, and the one with the highest probability is reported as the most likely match.

Discussion:
The authors use several abbreviations throughout the paper, and their overuse just adds another step required for translating the meaning of the paper. The overview of HMMs is a bit brief, but does refer the reader to a paper by Rabiner covering a tutorial of HMMs. The choice to use HMMs was supposedly made to be able to recognize dynamic gestures instead of static postures, but the three gestures used in the study can be distinguished without the use of HMMs. The motion of the hands was unnecessary to distinguish the gestures since the position of the hands is different enough to separate them. Also, there was no user study or reporting of recognition accuracy rates.

Saturday, April 5, 2008

Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control (Sawada 1997)

Summary:
The motivation for this paper comes from the idea that humans performing gestures can be transferred to an electronic instrument to produce an emotional musical performance without incurring the difficulties of the human learning the technique for controlling a specific instrument. The assumption that emotion is better portrayed by the force applied to an object than by the position of the hand on an object is also crucial to the work, which uses three-dimensional acceleration sensors to capture gesture motion. The acceleration sensor data has poor quantitative reproducibility; so instead of using simple pattern matching, global features of the motions are extracted. In the study, the magnitude of the acceleration change in eight principal directions, rotational direction, intensity of motion, and z direction of motion are used as characteristics for gesture recognition.
A user must go through training the system, which builds up a standard pattern of data, before recognition takes place so thresholds can be set on an individual level. Gesture segmentation is triggered by the intensity of acceleration. Averages and standard deviations of acceleration data are calculated and used as a basis for comparison during gesture recognition. Ten types of gestures are recognized and used to control the performance of MIDI music. The user whose gesture data trained the system was able to perform gestures with a 100% recognition rate, while another user's gestures were misclassified for particular gestures.
A tempo prediction model based on the previous two tempos is used to allow for smoother performance of the music than could be achieved by the system waiting for the recognition of a human determined tempo.

Discussion:
The tempo recognition based on change in acceleration magnitude is more accurate than image based methods. This reduction of phase delay in the detection of tempo is important when the control issue is specifically related to timing as it is in music control. Some of the gestures used in the study did not seem naturally related to music composition, such as the star, triangle, and heart shaped motions. The paper also did not describe how gestures were mapped to effects on the music, besides the downward motion being mapped to tempo.

Wednesday, April 2, 2008

Enabling Fast and Effortless Customisation in Accelerometer Based Gesture Interaction (Mantyjarvi 2004)

Summary:
The motivation for this paper is that gesture recognizers should be customizable, easy and quick to train. In order to keep training examples low, original training gestures are augmented with noise-distorted training gestures. A SoapBox (Sensing, Operating, and Activating Peripheral Box) sensing device with a three-axis accelerometer is used for collecting motion data. In the preprocessing phase, gesture data is normalized to equal length and amplitude. The normalized data is converted to one dimensional prototype vectors using the k-means algorithm with a codebook size of eight. An ergodic HMM was used in lieu of a left-to-right model, despite the left-to-right model's grooming to fit sequential time-series data, because both models were reported to give similar results according to Hoffman. Also, the choice of five states for each HMM was chosen to reflect the work done by Hoffman. New training data was generated from original gesture data with random noise added according to either a uniform or Gaussian distribution. The beginning and end of each gesture was signaled with a button press and release, alleviating the problem of segmentation.
A single subject performed eight gestures for controlling a DVD player thirty times each. As expected, increasing the number of training vectors increased recognition accuracy, but four training vectors was determined sufficient, since performing six repetitions of eight gestures was considered to be a nuisance to train. In an experiment to test the effects of the signal to noise ratio of the random distributions on recognition accuracies, the best accuracy was achieved with the Gaussian distributed noise with an SNR of 3, while the best uniformly distributed noise was achieved with an SNR of 5. For either type of noise distribution, recognition accuracy rates of over 96% were obtained with just one original and three distorted training examples. The best performance (nearly 98%) requiring less than three original vectors came from using two original vectors and two or four Gaussian distributed noise distorted vectors.

Discussion:
The results comparing uniform and Gaussian noise are very close, and I do not see a distinct winner. There does not seem to be a clear pattern to which noise distribution performs better, but I do think the addition of noisy copies of training vectors help recognition slightly. Although the paper consistently refers to recognizing 3D gestures, all the example gestures in the study for controlling a DVD player could adequately be defined in two dimensions. In the preprocessing stage, the gesture data is linearly extrapolated or interpolated if the data sequence is too long or short. A different type of extrapolation or interpolation technique besides linear could be used to approximate a better fit to the data. I am not sure that the reliance on Hoffman for the type of HMM and number of states was well founded. The paper states that adding noise increases detectability in decision making under certain conditions, and cites a paper by Kay for justification. It might be interesting to see what those conditions are, and if this paper meets the criteria set by those conditions.

Gesture Recognition - Jace Miller