Posted on July 5, 2018 at 12:49
Tag: Open Source
Whenever you have a robot in an environment interacting with humans, there has to be some sort of interface that the human can use to instruct the robot to perform a certain task or query. For Robocup@Home, the primary interface is speech, where the operator has to instruct our Toyota Human Support Robot (HSR) with certain tasks. Our team decided that the Google Speech-to-Text API would be the optimal solution to parse an operator’s input and extract relevant parameters and intents.
Our first design challenge was conducting all the computation on an external computer while using the microphone that is available on the robot since the robot has limited computational power. Since the external computer and the robot were always on the same network, I wrote a simple python socket that serves audio from the microphone to a requesting client (the computer). Audio was stored in a queue to make sure that all the data was being streamed and processed data was automatically removed. A python generator continuously yielded data to Google’s servers using the Speech API’s gRPC protocols.
Figure 1: Flowchart of the audio pipeline
A nice feature that the Speech API had was sending metadata that described where and how the audio was being recorded. This provided us with much more accurate text and removed most of the preprocessing we had to do on our end related to detecting noise or when an utterance started or ended. One thing we did keep though was the use of a hotkey to trigger when the audio server would begin recording. So an operator had to say “HSR, bring me the cup” or “Frasier, where is the waving woman?” in order to trigger a response from the robot.
Now that we have the text, we can conduct Natural Language Parsing (NLP) on it and extract the associated intentions, parameters, and actions associated with the operator’s request. This was done by passing the text recieved from the Speech API to Google’s NLP platform Dialogflow. Dialogflow provides us with an action/function we can perform and the associated parameters/arguments extracted from the sentence. Since we were using this for the Robocup@Home competition, a majority of the questions and their grammars were known before hand, including how some of the answers should be phrased. This made it rather simple to define the various intentions of each request and the associated actions. On the back-end, when we received our response from Dialogflow, we executed the requested function and passed any parameters as arguments. Responses that involved extracting information from the robots knowledge base were processed in the back-end as well and if the robot did not know how to respond, it use the default response received from Dialogflow.
This approach however was very deterministic and did not account for complex tasks that we would come across the General Purpose Service Robot challenge at Robocup@Home. Thus, we have transitioned to a more robust approach using Google’s NLP API that allows us to construct sequences of actions and handle requests the robot does not understand or have a fallback behavior.
All this was of course done in the ROS environment since that’s how we communicate with our robot! Although the audio server that reads in audio data was not a ROS node, the audio client that sends the data to Google’s servers also published the utterances received from the speech API onto a ROS topic. This was used by the Dialogflow ROS node which took in the text data and published a custom ROS message for Dialogflow’s response. This included the action, parameters, fallback response, and context. Our new approach with Google’s NLP API now publishes a sequence of tasks with the associated parameters and functions as needed.