Proposals for a speech recognition based input and control system

As I was once a sufferer from RSI, I used Dragon Systems' DragonDictate to control computers, using Exceed or telnet to connect to the Unix systems I manage and use. Experience has shown that this combination has a number of deficiencies which any free software implementation would do well to address; this document is an attempt to describe the features I personally would like to see in any such system.

Speech Recognition

DragonDictate is a discrete speech recognition system, meaning that one must pause between words when dictating. I find this fine for dictating Unix commands, which are by their nature discrete units, but for dictating English text a continuous speech recognition system (where one can speak naturally without pauses) is better. The ideal speech recognition system would be able to cope with both styles of dictation.

The recognition system should adapt to the user's voice over time as they make corrections to any misrecognitions, but one should not have to train the computer in advance to recognise every word.

There should be parameters in the recognition system to account for using different sound cards and microphones; it should be possible to use the system's model of the user's voice on more than one computer or set of hardware.

Vocabulary

The system must have a large and flexible set of vocabularies, and it should be easy to switch vocabularies in and out of the active set to expand or restrict the possible interpretations of any utterance.

Vocabularies should be able to be speaker-independent; users should be able to share and publish lists of words for use in particular subject areas.

Interaction With Other Applications

Speech recognition systems seem to behave better when there is a limited vocabulary active at any one time. A good default is for each application to have its own vocabulary, activated when that application is started or switched to, to which the user can add application-specific vocabulary items.

Graphical applications should be controllable by voice. Menu items, buttons, and the like should be operable by speaking their name; perhaps the application's window hierarchy could be scanned for such components when the window is mapped. Dragon calls this behaviour tracking. Applications that expose their workings via CORBA should be controllable using that.

It is tedious for the user to have manually to switch vocabularies when entering text into, say, the address box of a web browser. A dictation vocabulary should be able to be activated while the user is "in" a particular text entry component.

Applications should be able to communicate information about their state to the recognition system so that it can activate vocabularies appropriate to that state. Since applications to be controlled by voice may be running on different machines (either via terminal sessions or remote X connections), this mechanism needs to work across the network.

No application should be expected to know about the speech recognition system, nor should the recognition system be expected to know about all applications. A system of hooks or similar.

Scripting and Macros

The speech recognition system should at least have a flexible and extensible macro language, and preferably be controllable from the programming language of the user's choice.