Gesture Recognition - Jace Miller: Glove-TalkII - A Nerual-Network Interface which Maps Gestures to Parallel Format Speech Synthesizer Controls (Fels 1998)

Summary:
The Glove-TalkII system maps hand gestures to control parameters of a formant speech synthesizer using an adaptive interface based on neural networks. The right hand's position, orientation, and finger joints are sampled frequently to control the synthesizer parameters. When producing vowels, hand position determines what type of vowel sound to make. When producing consonants, the hand posture determines what consonant sound to make. The left hand wears a ContactGlove which measures nine points of contact and is used to signal stop consonants. A foot pedal controls overall volume of the system while hand height controls speech pitch.
Three neural networks were established to determine what sound to produce. The Vowel/Consonant network is a 10-5-1 feedforward network with sigmoid activations which determines whether the sound to be emitted should be a vowel or a consonant. The Vowel network is a 2-11-8 feedforward network which determines what vowel sound to produce, while the Consonant network is a 10-14-9 feedforward network that determines what consonant sound to produce. The 11 and 14 hidden units of these networks are normalized radial basis functions that correspond to one of 11 cardinal vowels and 14 static consonant phonemes, respectively. The 8 and 9 output units determine parameters to the formant synthesizer. The additional output of the consonant network determines the voicing parameter.

Discussion:
Only one person has been trained to use the system well enough to speak intelligibly. The subject was an accomplished piano player and trained for over 100 hours. Even so, the speed of speech is about 1.5 to 3 times slower than average. With such a time commitment and skill level as a barrier to using the system, I don't think it is likely to ever become practical.
The concept of stop consonants seemed to be introduced to help resolve the problem of sounds whose dynamics were too fast to be controlled by the user. This reminded me of a problem I am having with my final project which uses a Wii remote and a sensorbar to emulate a violin. Creating a smooth sounding transition between notes is not strait forward. I could try to introduce a special transition sound sample to be played in the midst of transitions that would create a more realistic sound.

Gesture Recognition - Jace Miller

Wednesday, April 16, 2008

Glove-TalkII - A Nerual-Network Interface which Maps Gestures to Parallel Format Speech Synthesizer Controls (Fels 1998)

No comments:

Blog Archive

About Me