That voice in your GPS navigator, the virtual assistant in your smartphone, and the automated responses you get when you dial a company helpline number are not real voices. In other words, there is no big database of spoken words that the computer picks up and strings them together to create a sentence. They are generated on the fly by the computer, yet they sound so natural, so human, that more often than not they are completely indistinguishable from that of a real person.
Human voices are much more complicated, acoustically, compared to, say, the barking of a dog or the crash of a cymbal. The variety itself is staggering. Like human faces, no two voices sound exactly similar. Add to that the various inflections and emotions, the stresses on the syllables, the accents. Replicating the nuances of speech is anything but easy. It is quite an achievement that we are able to synthesize human voice at all, and with such precision.
One of the earliest efforts to produce synthetic speech was made over two hundred years ago, in 1779, by Russian Professor Christian Kratzenstein. Kratzenstein built an apparatus consisting of a number of vibrating reeds that were acoustically similar to the human vocal tract. His device could produce the five long vowels artificially.
A few years later, in 1791, an inventor in Vienna named Wolfgang von Kempelen built a more detailed machine modeled after the various human organs that make speech possible. The machine had a pair of bellows to simulate the lungs, a vibrating reed to act as vocal cords, a leather tube for the vocal tract, two nostrils, leather tongues and lips. By manipulating the shape of the leather tube and the position of the tongues and lips, von Kempelen was able to produce consonants as well as vowels. Nearly half a century later, Charles Wheatstone constructed an improved version of von Kempelen’s speaking machine that could pronounce most of the consonant sounds and even a couple of full words.
The first device to be considered a true speech synthesizer was the VODER (Voice Operating Demonstrator) developed by Homer Dudley of Bell Labs in the 1930s. It was a rather complicated machine with fourteen piano-like keys, a bar controlled by the wrist, and a foot pedal which the operator could manipulate and cause the machine to speak. It sounded very robotic, like “an alien speaking under water” as Lisa Guernsey of the New York Times described it.
In fact, the “robotic voice” that we often hear in old science fiction movies and television drama possibly originated from VODER. “Once the true voice of the machine had entered the public consciousness, it’s place and form in fictional portrayal would never be the same,” writes Ben Fino-Radin of Rhizome. “After that day in 1939, we knew specifically how inhuman machined speech should sound.”
The website whatisthevoder.com, describes how the VODER worked:
An operator would select one of two basic sounds by using the wrist bar: a buzz tone and a hissing sound. The buzz tone was the building block for vowel sounds and nasal type sounds. The hissing sound was the building block for those sounds associated with consonants.
These sounds were then passed through a bank of filters that were selected by the user by selecting the appropriate keys on the keyboard. These sounds were combined and sent through a loudspeaker. For sounds not replicable by the buzzing or hissing noises, such as “p”, “d”, “j”, and “ch”, additional filters were selectable.
Different words could be combined into different sentences based on the manipulation of keys and sounds. You could even add in different expressions and pitches (controlled by the foot pedal) based on the type of question that is being asked.
Mrs. Helen Harper, who was the central operator of the VODER during its demonstration at the 1939 New York World’s Fair, gives us an idea of how difficult it was to master the beast.
“For example,” we hear Mrs. Harper speaking in a video, “in producing the word ‘concentration’ on the VODER, I have to form thirteen different sounds in succession and make five up and down movements of the wrist bar and vary the position of the foot pedal from three to five times according to what expression I want the VODER to give the word. And of course, all this must be done with exactly the correct timing.”
It took Harper a year of constant practice before she was learned to operate the machine with precision. As many as three hundred girls underwent training to become an operator, but less than thirty got the skills right.
A skilled operator such as Mrs. Harper can make VODER speak any language, moo like a cow or grunt like a pig. She can even make it sing, as demonstrated in the following video.