Posted on November 9, 2018 at 21:54
After the multiple issues we encountered during Robocup@Home, I think we definetly learned some lessons on how to properly approach speech for robotics applications. Noise, voice activity detection, and beamforming are all active areas of research, however, they’ve recently become mature technologies. Based on my experience with speech IRL, here’s how the framework should be designed - TL;DR use an API :)
Since 2016 when we first started prepping for Robocup, we tried about 3 different frameworks: Facebook’s Wit.ai, CMU’s PocketSphinx, and ROSpeex. I think ROSpeex was the best out of those, not just because it was ROS compatible, but it had VAD built in. It’s main drawback was its difficult interface and API, as well as its browser based-interface didn’t work well all the time. PocketSphinx was excellent for offline speech recognition, with the drawback of requiring to specifiy your own corpus/dictionary. This involved a lot of manual work so we avoided it, however, I think the next step would be to use PocketSphinx in the event we don’t have network connection.
There will be noise, and lots of it I may add. There will be noise from you and noise from sounds you never even thought to consider. Its important to somehow filter out these signals prior to processing. The way we do so is via VAD. I won’t dive into the details since I don’t understand it much myself, but the way it works is through some feature extraction of the signal-to-noise ratio (and the approaches differ from there). Not all speech-to-text frameworks perform this and the one I found to do this best was Google’s API. This was one of the main reasons we chose it.
After we blocked out all the noise, we could get some legible sentences to parse. Unfortunately, sometimes even those sentences can be considered noise. If I’m talking and someone in the background decides to rudely interrupt me, my brain can obviously filter out the interruption as noise - the algorithm does not differentiate. This was a critical issue we failed to address where in the first Robocup, our robot would say something and would process its own speech as a sentence to parse, resulting in an infinte loop of “Sorry, I couldn’t get that” and “Say it again please?” My naive approach was the following:
sentence = sentence.split(' ', 1) if any(hotword in sentence for hotword in hotword_list): parse_sentence(sentence)
Foolproof! This led to many issues such as needing to parse all white spaces (sometimes they came as the first characters) and people saying our hotword differently, thus being recognized with slightly different characters by the STT API. We eventually shifted over to Baidu’s Snowboy. This was much more cleaner since we were able to simply execute a callback whenever the hotword was deducted.
You finally cleaned up your sentence and extract only the signals you want. Now what do you do with the text? NLP of course. Good thing Dialogflow took care of this, providing us with the user’s intent and the action we need to execute. But how exactly do we go about doing so? I think this is platform-dependent, but we approached it the ROS way. We basically had a dictionary of functions and services to call as needed based on the action provided by DF. Parameters were passed accordingly as function arguments or ROS messages. While this is a rather simple approach, I don’t think it scales appropriately. Once you start to have 10 different modules maintained by 10 different people, it’s difficult to maintain all their dependencies in one package. This is for another discussion on overall system architecture design.
Its one thing to simply respond to a user’s request and another to respond intelligently and with the right context. This an important feature to consider when using Dialogflow since you can pass parameters from previous contexts which you can recall later in your backend. It also plays a role in providing social/cultural/age appropriate responses, a topic that is actually undergoing research and is a business strategy employed by some. DialogflowROS actually takes this into consideration, where wrappers are provided for the contexts and their parameters.
Leave me your thoughts and suggestions!