Wednesday, April 2, 2008

Enabling Fast and Effortless Customisation in Accelerometer Based Gesture Interaction (Mantyjarvi 2004)

Summary:
The motivation for this paper is that gesture recognizers should be customizable, easy and quick to train. In order to keep training examples low, original training gestures are augmented with noise-distorted training gestures. A SoapBox (Sensing, Operating, and Activating Peripheral Box) sensing device with a three-axis accelerometer is used for collecting motion data. In the preprocessing phase, gesture data is normalized to equal length and amplitude. The normalized data is converted to one dimensional prototype vectors using the k-means algorithm with a codebook size of eight. An ergodic HMM was used in lieu of a left-to-right model, despite the left-to-right model's grooming to fit sequential time-series data, because both models were reported to give similar results according to Hoffman. Also, the choice of five states for each HMM was chosen to reflect the work done by Hoffman. New training data was generated from original gesture data with random noise added according to either a uniform or Gaussian distribution. The beginning and end of each gesture was signaled with a button press and release, alleviating the problem of segmentation.
A single subject performed eight gestures for controlling a DVD player thirty times each. As expected, increasing the number of training vectors increased recognition accuracy, but four training vectors was determined sufficient, since performing six repetitions of eight gestures was considered to be a nuisance to train. In an experiment to test the effects of the signal to noise ratio of the random distributions on recognition accuracies, the best accuracy was achieved with the Gaussian distributed noise with an SNR of 3, while the best uniformly distributed noise was achieved with an SNR of 5. For either type of noise distribution, recognition accuracy rates of over 96% were obtained with just one original and three distorted training examples. The best performance (nearly 98%) requiring less than three original vectors came from using two original vectors and two or four Gaussian distributed noise distorted vectors.

Discussion:
The results comparing uniform and Gaussian noise are very close, and I do not see a distinct winner. There does not seem to be a clear pattern to which noise distribution performs better, but I do think the addition of noisy copies of training vectors help recognition slightly. Although the paper consistently refers to recognizing 3D gestures, all the example gestures in the study for controlling a DVD player could adequately be defined in two dimensions. In the preprocessing stage, the gesture data is linearly extrapolated or interpolated if the data sequence is too long or short. A different type of extrapolation or interpolation technique besides linear could be used to approximate a better fit to the data. I am not sure that the reliance on Hoffman for the type of HMM and number of states was well founded. The paper states that adding noise increases detectability in decision making under certain conditions, and cites a paper by Kay for justification. It might be interesting to see what those conditions are, and if this paper meets the criteria set by those conditions.

No comments: