Access it on SmilingRobo here
Petoi’s Bittle is a palm-sized, open-source, programmable robot dog for STEM and fun. Bittle can connect with Raspberry Pi and can be easily extended. This project is done during my intern in Petoi. The goal was to develop a real-time voice control module for Bittle and command Bittle to do some actions.
The conclusion is that VAD(Voice Activity Detection) + DTW + Vosk are used to achieve the real-time voice control.
PyAudio is used at the beginning, but it is an old library. So Sounddevice and Soundfile are used instead.
From a functional point of view, the methods to do this can be divided into:
This belongs to the second category and it’s similar to template matching. DTW can calculate the cost to match one piece of audio with template audio. We can pick the audio with the lowest cost. This method does not need training and is also applicable even if you want to add new commands. The bad thing is that the calculation is time-consuming. But at least the command audios are short in time and we can find ways to eliminate the silence and extract MFCC(Mel-frequency Cepstral Coefficients) feature.
This is a demo Speech Command Recognition with torchaudio — PyTorch Tutorials which is done by PyTorch Official. But we need to re-train the model when we have new commands coming in.
Was inspired by a blog Audio Handling Basics: Process Audio Files In Command-Line or Python | Hacker Noon. The blog mentions that we can eliminate the silence part of an audio recording according to the short-term energy of audio data. A Python library called librosa provides some functions for doing that.
Tried some open source methods:
Offline recognition, provides lightweight tflite models for low-resource devices.
Requires 16bit 16KHz mono channel audio. A new version supports Chinese.
Tested it by using non-strip and stripped audios with both large and small size models but it did not do well. For example:
起立 -> 嘶力/成立 向前跑 -> 睡前跑 向前走 -> 当前走
Hey Bittle Stand up Walk forward Run forward
Have used 16 recordings for now. An empty result is shown when it encounters OOV(out of vocabulary) words. “Bittle” would be recognized as “be to”. After silence elimination, some results have changed from wrong to correct, and some have changed from correct to the wrong (this may be due to the reduction of the silence between the pronunciation of words).
SeanNaren/deepspeech.pytorch It does not have lightweight models and the models are near 900MB. It’s too big for a Raspberry Pi.
Uberi/speech_recognition It provides multiple ways such as using Google/MS Api. The only one method to offline recognition is no longer being maintained.
alphacep/vosk (Used) Vosk provides offline recognition and light-weighted models for both Chinese and Chinese. The documentation is not complete.
A test result for the Chinese model
Non-strip Correct----------|----------Stripped Correct----------|----------Total correct
16/21----------------------------|-----------16/21--------------------------|-----------32/42
Access it on SmilingRobo here