On linkedin, Someone posted a link to an example of Unity’s speech recognition API for Windows 10. It sounded simple to set up, so I decided to try it out. I’d done some basic research a while back to see if speech input might be an decent alternative to the text parser input in Cascade Quest.
With Unity 5.5 and Windows 10, it was just a matter of minutes before I had something very basic working.
With the Unity APIs, there are basically three ways to configure speech input:
- A dictation recognizer, which can recognize general phrases. This requires internet connectivity though, and so it probably not suitable for the quick responses needed for a game.
- A keyword recognizer, which simply recognizes single words (or series of words) from a fixed list. This is also not very useful for Cascade Quest, since I need to be able to recognize natural language, like the text parser can.
- A grammar recognizer, which is provided a list of grammar rules to base its output on. This is really the only viable solution for me.
Unfortunately the grammar needs to be in the form of an SRGS XML file. This isn’t ideal, since I’m constructing the grammar (including the words to be used) from data available at runtime.
There were two main challenges to overcome:
- Turning the actual grammar rules into SRGS.
- “Cleaning” the word list so it can be used in the grammar.
The form the text parser grammar was in was challenging to convert to SRGS. To add the problem, there is no way to debug issues with a broken grammar.
The API takes an xml file, loads it and then either works or doesn’t. It gives no indication if the SRGS is invalid, or if something just isn’t recognized. There are no errors or status (other than “everything is ok, I’m running”). On top of that, if a grammar is loaded that is invalid, all future attempts with a correctly-functioning grammar will also fail to work (until Unity is closed and restarted). I don’t know if this is a problem with the underlying Windows APIs or how Unity uses them. Nevertheless, it made it extremely tedious and (until I figured out what was going on) confusing to debug.
In the end, I ended up having to hand-code some reasonable grammar, increasing the complexity bit by bit and always testing it still worked.
The next issue is the words- or specifically, the transformations of the words (pluralizing, adding -ing or -ed, etc…). My game has rules for modifying the suffixes of words, but they work from completed text input back to a known word. For example, postmen is converted to postman, which is a known word. Likewise, wolves would be converted to wolf. We check the suffix of the word, and if it is a known plural suffix, we convert it to the singular form and validate that it is a known word.
The speech recognizer grammar needs to include all variations of a word, but that means going in the other direction: from the singular to the plural (or root to transformed). For “wolf”, we would generate a number of options: wolves, wolfs, wolfes. All these are valid transformations based on our suffix rules. But not all of them are real words. So to validate these transformations, we need to check against a master English word list. I found one online that contained about 500,000 words (including plurals, verb tenses, and such).
The end result is something that works “ok”. I’m sure it would be a lot more accurate if I could limit the grammar only to words that are included in the current loaded room (or at least weight those words more highly). I could do that, but I would need a different SRGS XML file for each room in the game. Even that isn’t a great solution though, because other pieces of logic in the game (which respond to text parser input) might be loaded or unloaded dynamically.
Anyway, speech recognition as an input method for Cascade Quest is something I’ll keep in my back pocket as a possibility.
How does it actually feel to play? “Ok”. I found it got tiring speaking to the game for very long. Would anyone actually use it? If I get it into a more reliable state, maybe I’ll do some playtesting to find out.