Hot word detection using PocketSphinx

Sunil D Shashidhara

Jan 5, 20192 min read

Updated: Feb 22, 2019

What if you want your voice assistant system to start listening on saying "Sussi, play 'We are the champions'"? Imagine you need to analyze an audio file and count how many times Trump says “billions”

You need hot word detection! In the above examples, "Sussi" and “billions” are hot words.

Okay, great! How do you build something like this? Full blown speech recognition? Nah, that’s too compute intensive. Welcome, PocketSphinx!

PocketSphinx

PocketSphinx is a lightweight speech recognition engine developed by CMU, that offers a wide range of functionalities, here we concentrate on the detection of hot word.

PocketSphinx has the ability to detect multiple hot words. You can specify the hot words you are interested in along with the phonemes and sensitivity for each hot word. It can also detect hot words that are not part of the English dictionary as long as you provide the phonetic description of the word

# hot word /threshold/

Alex /1e-40/

Sussi /1e-30/

Installation

Detailed instructions for installation (https://github.com/cmusphinx/pocketsphinx-python)

PocketSphinx includes Python support, however, it is based on Automake and not well supported on Windows.

pip install pocketsphinx

If you are using a Raspberry Pi or other ARM based boards, you may need to install swig additionally.

sudo apt install swig

PocketSphinx uses pyAudio to connect and listen to audio stream

pip install PyAudio

Code Snippets

Create a hotword dictionary file - hotwords.dict

shop SH AA P

assist AH S IH S T

You can specify the key phrase of your choice but make sure it has proper phonemes. You can consider generating phonemes using G2PModel

Imports

from pocketsphinx import Decoder

import pyaudio

Start a pyaudio stream

p = pyaudio.PyAudio()

stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=20480)

stream.start_stream()

Configuration

model_dir = os.path.join(pocketsphinx_dir, 'model')

ps_config = Decoder.default_config()

ps_config.set_string('-hmm', path.join(model_dir, 'en-us/en-us'))

ps_config.set_string('-dict', path.join(model_dir, 'en-us/hotwords.dict'))

ps_config.set_string('-keyphrase', 'shop assist')

ps_config.set_float('-kws_threshold', 1e-30)

The decoder needs to be configured with a language model, dictionary containing phonemes for key word and the key word along with threshold.

Setting threshold can play an important role in detection. Threshold ranges from 1e-1 to 1e-50. For shorter key phrases you can use larger thresholds like 1e-1, for longer key phrases the threshold must be smaller, up to 1e-50. Threshold must be tuned in order to balance between false alarms (false positive) and missed detection (false negative). High threshold can lead to false negatives and low thresholds can lead to false positives.

Start decoder

decoder = Decoder(ps_config)

decoder.start_utt()

In a forever loop, read the input stream using pyAudio and process the raw data in chunks of size 1024 frames. Check for hotword, if detected, proceed to perform the cool tasks :)

while True:

buf = stream.read(1024)

if buf:

decoder.process_raw(buf, False, False)

else:

break

if decoder.hyp() is not None:

print([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])

print("Detected %s at %s" % (decoder.hyp().hypstr, str(datetime.now().time())))

decoder.end_utt()

# PERFORM COOL TASK

Adjust threshold value and phonemes if you observe false positives or false negatives.

References

https://cmusphinx.github.io/wiki/tutoriallm/

https://pypi.org/project/pocketsphinx/

https://github.com/cmusphinx/pocketsphinx-python

Upcoming Blog

Detection of words in audio files by auto generating phonemes and setting the threshold by fine tuning

Thanks to our Cranky Coder Balasubramanyam S for the contribution :)