What if you want your voice assistant system to start listening on saying "Sussi, play 'We are the champions'"? Imagine you need to analyze an audio file and count how many times Trump says “billions”
You need hot word detection! In the above examples, "Sussi" and “billions” are hot words.
Okay, great! How do you build something like this? Full blown speech recognition? Nah, that’s too compute intensive. Welcome, PocketSphinx!
PocketSphinx
PocketSphinx is a lightweight speech recognition engine developed by CMU, that offers a wide range of functionalities, here we concentrate on the detection of hot word.
PocketSphinx has the ability to detect multiple hot words. You can specify the hot words you are interested in along with the phonemes and sensitivity for each hot word. It can also detect hot words that are not part of the English dictionary as long as you provide the phonetic description of the word
# hot word /threshold/
Alex /1e-40/
Sussi /1e-30/
Installation
Detailed instructions for installation (https://github.com/cmusphinx/pocketsphinx-python)
PocketSphinx includes Python support, however, it is based on Automake and not well supported on Windows.
pip install pocketsphinx
If you are using a Raspberry Pi or other ARM based boards, you may need to install swig additionally.
sudo apt install swig
PocketSphinx uses pyAudio to connect and listen to audio stream
pip install PyAudio
Code Snippets
Create a hotword dictionary file - hotwords.dict
shop SH AA P
assist AH S IH S T
You can specify the key phrase of your choice but make sure it has proper phonemes. You can consider generating phonemes using G2PModel
Imports
from pocketsphinx import Decoder
import pyaudio
Start a pyaudio stream
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=20480)
stream.start_stream()
Configuration
model_dir = os.path.join(pocketsphinx_dir, 'model')
ps_config = Decoder.default_config()
ps_config.set_string('-hmm', path.join(model_dir, 'en-us/en-us'))
ps_config.set_string('-dict', path.join(model_dir, 'en-us/hotwords.dict'))
ps_config.set_string('-keyphrase', 'shop assist')
ps_config.set_float('-kws_threshold', 1e-30)
The decoder needs to be configured with a language model, dictionary containing phonemes for key word and the key word along with threshold.
Setting threshold can play an important role in detection. Threshold ranges from 1e-1 to 1e-50. For shorter key phrases you can use larger thresholds like 1e-1, for longer key phrases the threshold must be smaller, up to 1e-50. Threshold must be tuned in order to balance between false alarms (false positive) and missed detection (false negative). High threshold can lead to false negatives and low thresholds can lead to false positives.
Start decoder
decoder = Decoder(ps_config)
decoder.start_utt()
In a forever loop, read the input stream using pyAudio and process the raw data in chunks of size 1024 frames. Check for hotword, if detected, proceed to perform the cool tasks :)
while True:
buf = stream.read(1024)
if buf:
decoder.process_raw(buf, False, False)
else:
break
if decoder.hyp() is not None:
print([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])
print("Detected %s at %s" % (decoder.hyp().hypstr, str(datetime.now().time())))
decoder.end_utt()
# PERFORM COOL TASK
Adjust threshold value and phonemes if you observe false positives or false negatives.
References
Upcoming Blog
Detection of words in audio files by auto generating phonemes and setting the threshold by fine tuning
Thanks to our Cranky Coder Balasubramanyam S for the contribution :)
Comments