Speech recognition: a fruit by “any other” name

This case study illustrates one of the many reasons why speech recognition is difficult, both in design and implementation.

The following behavior is very difficult to deliver:

  • “Tell me the name of a fruit that you like.”
  • If we hear Pineapple, Banana, Grape, Raspberry, or Orange, do X
  • If we hear any other fruit, do Y
  • If we hear anything that doesn’t sound enough like the name of a fruit, do Z for Failure

Basket X, by itself, is not very problematic. Each item is given a line for the recognizer to match, explicitly. One can add in alternate spellings or pronunciations to help it interpret what it “hears”, like this:

  • pineapple,pyneappul (do X1)
  • banana (do X2)
  • grape,grapes (do X3)
  • raspberry,razzberry (do X4)
  • orange,orrinj (do X5)

The recognizer receives an utterance (a series of sounds, considered together) from the human, and it scores that sequence against the list of fruit-name pronunciations that it “knows”. It assigns a Confidence Level, 0% up to 100%, for the one or several items on the list that appear to be the best match.

If the Confidence Level is above some assigned threshold, e.g. 45%, the recognizer returns that result as the best guess at what was said. But, if no item has received a score as high as the threshold, the returned value is No Match: the recognizer is not confident enough that the utterance matched anything on the list it is listening for. No Match? Do Z for Failure.

So far, that is not yet difficult. We always get back X1, X2, X3, X4, X5, or Z (No Match).

The system is ruined by the addition of the “any other fruit” basket, for behavior Y. The recognizer cannot know how to judge Y apart from Failure Z, unless it is given a comprehensive list of all the elements of Y that it might positively match. We must add all of these items:

  • kiwi (do Y1)
  • kumquat (do Y2)
  • watermelon (do Y3)
  • strawberry (do Y4)
  • tangelo,tanjelo (do Y5)
  • mango (do Y6)
  • black raspberry (do Y7)
  • apricot (do Y8)
  • peach (do Y9)
  • apple (do Y10)
  • …, ad infinitum, (do Yn)

Suddenly, our list has grown from five easily-distinguishable items into a much longer list, merely by trying to add one “all other fruits” basket.

  • Speech recognition works only on positive matches of known values. The recognizer cannot judge by any external attributes whether it sounded like the name of some fruit, through the meaning of the word. It does not even hear words. It hears only a sequence of meaningless sounds that it tries to match to a known list.
  • The recognizer has no way to know if “watermelon” or “water moccasin” are fruits, or any way to distinguish the two items from one another, unless one or both of them are on the list of sounds it is trying to match.
  • Again, we can’t build basket Y unless we have a complete list of all the individual fruits that should be in it.

With the addition of basket Y, the whole system begins to deliver results that humans would consider unacceptable failure:

  • The more one tries to “teach” all possible fruits to the recognizer, the less capable it gets at distinguishing any of them.
  • It becomes more difficult to say anything that will reliably return the No Match, Z, because we might accidentally hit something on the long list. If the human really said “beach”, a non-fruit, it is clearly not like anything in basket X; but, it hits too confidently on “peach” in basket Y, so we go to Y instead of Z.
  • The longer the list is, {X1, X2, X3, …, Xm, Y1, Y2, Y3, …, Yn}, the greater the chance that the recognizer will occasionally mis-recognize some ordinary things, because there are too many similar-sounding items to judge.

Proper testing of the system becomes forbiddingly difficult, as well:

  • To ensure the system’s accuracy, all of that list for behavior Y needs to be tested individually (kiwi, kumquat, watermelon, strawberry, tangelo, mango, black raspberry, apricot, peach, apple, … [hundreds of them]), to be sure that each known but unwanted fruit hits basket Y instead of the general Failure, Z.
  • The fruits that we really care about most, in basket X, lose some of their valid hits: whenever the recognizer “hears” something in basket Y that returned a higher score than the one the human really said, in basket X. Perhaps the human said “pineapple”, and the system should have returned basket X, but part of the sound got cut off in transmission. Unknown to the human, the system heard only “-apple”, it found a match for “apple” in basket Y (with higher Confidence Level than “pineapple” in basket X), and returned an unexpected behavior. Stupid computer! I said “pineapple”, which doesn’t resemble an apple in any way! “I’m sorry, we don’t have that fruit today.” Huh? When did they run out of pineapple? (It’s not telling me that it really heard “apple”….)
  • We must also ensure that an utterance intended for basket Y does not generate mistaken hits into basket X! Let’s see: I really said “black raspberry”, but the system heard only the “raspberry” part, and it acted accordingly. Meanwhile, I as an intelligent human am absolutely certain that I actually said “black raspberry”. Furthermore, I am certain that I intended to say “black raspberry”, and that I really mean “black raspberry”, not “raspberry”. How could the computer not recognize my intentions or my meaning? It seemed human enough, in the other interactions we have had during this session…. My mental model of the intelligent computer crashes, suddenly. Why did the computer suddenly become incompetent at understanding me?
  • An attempt to improve the sensitivity of “pineapple” vs “apple” (or “black raspberry” vs “raspberry”) might not work, because we cannot predict or reproduce the transmission dropouts or noise that affected only that single experimental trial. The “pineapple” and “black raspberry” test cases certainly made the system seem broken, yes, but that was only one hit, each. It just happened to be on the tester’s first and only trial, setting the (perhaps mistaken) expectation that the whole system is not yet adequately accurate. Who wants to test the system 1000 times, with a representative set of humans and environmental conditions all properly controlled, just to be able to determine the proper experimental percentage of accuracy?

http://blogs.angel.com/blog/wp-trackback.php?p=287

I would treat this no different from any other speech recognition interaction. A speech application has to deal with what users actually say, so the grammar should be built so that it covers the most likely user responses (including X and Y fruits, as well as other likely responses), with the possible addition of grammar weights to favor the most likely fruits.

As far as testing goes, I see no point in trying to individually test every single possible fruit in every possible condition since this may not be representative of what actually happens in practice. I would treat this as a tuning problem, where our “test” is based on a representative sample of actual user utterances (a few thousands, say) collected from the real application. This ensures that we optimize for what users actually say and what actually happens, not some artificially fabricated test.

You are really off on the wrong track both of you. A simple qualification tie-in would overcome any and all glitch that you both come up with!

If there is an interruption and before the system accepts “apple” rather than “pineapple” you could do a qualification loop: “I’m sorry, did you say apple or pineapple?” If this still fails, you can do the “now I am lost routine”, Press 1 for apple and 2 for pineapple. Problems solved.

If you use such a loop, the customer doesn’t mind, as he is being serviced.

Ultimately, if nothing is understood, you can transfer to live agent! They still exist, I hope you both know! ;)

Biggest point speech IVR is not the solution to every call center or automated voice prompt scenario!

KISSes

Certainly, there could be some improvements:

- When heading to each of the five “good” fruits, send the caller first to a question such as “Yes or no, was that pineapple?”, and if you get a No, go try again.

- In the confidence scores of the main question, make a middle range where we’re not high enough to go directly to a known fruit, but not low enough to go to No Match. If we hit this, play some prompt such as “Please say that again”, giving the recognizer a second chance to hear what the caller wants. It’s short enough that the caller won’t barge into it.

But again, the main point of this article was something else. It’s not terribly difficult to do a small handful of X1/X2/Xn choices (plus a No Match Z) well — with or without confirmation re-prompts. The point was: if we try to add a Y category of “all other” known but unwanted fruits, it will likely ruin the performance of both the X items and the Z No Match. It waters down the several good Xn choices we really want, and it creates ambiguities that wouldn’t be there without the Y category.

Login or register to post a comment.