DigiKey-eMag-Embedded and MCUs-Vol 16

How to implement a voice user interface on resource-constrained MCUs

DSpotter asks for each command word or phrase, which the tool breaks down into phonemes. The command set and supporting data for the VUI are then built into a binary file that the developer includes in the project along with the Cyberon library. The library and the binary file are used together on the MCU to support the recognition of the desired speech commands. The DSpotter tool creates ‘CommandSets’ that can be logically connected by the developer’s program to create a VUI with different levels. This allows for multi-level commands such as, ‘I’d like the lightbulb set to high, please’: the command words being ‘lightbulb’, followed by ‘set’, and ‘high’. Each command in a group has its own index, as does each command within a level (Figure 3). The DSpotter library processes incoming sound and searches for phonemes that match the commands in the database. When it finds a match, it returns with the index and group numbers. Such an arrangement allows the main application code to create a hierarchical switch statement to process the command words/ phrases as they come. The resulting library can be small enough to fit on an MCU with just 256 kilobytes (Kbytes) of flash memory and 32 Kbytes of SRAM. The CommandSet can grow if more memory is available.

Figure 3: The DSpotter tool allows the creation of ‘CommandSets’ that can be logically connected by the developer’s program to create a VUI with different levels. Image source: Renesas library of phonemes and phoneme combinations. This is an alternative approach to the traditional and computing-heavy training of algorithms to recognize specific words. To break down words into phonemes and then represent them as tokens, the developer can use the DSpotter Modeling Tool. DSpotter is embedded (non-Cloud) software that works as a local voice trigger and command-recognition solution with robust noise reduction. It consumes minimal resources and is highly accurate. Depending on the selected MCU, secure data transfer can also be implemented.

Renesas has shown a commercial VUI software package based on the phoneme principle as part of its ecosystem. The software, called Cyberon DSpotter, creates a VUI algorithm that is streamlined enough to run on Renesas RA series MCUs featuring Arm Cortex-M4 and M33 cores.

has just 14. If a VUI uses an English command set of 200 words, each word could be broken down into its associated phonemes from the set of 44. Within VUI software, each phoneme could then be identified by a numeric code (or a ‘token’), with the various tokens forming the language. Storing words as sounds requires extensive computational resources and takes up far more memory space than phonemes stored as tokens. Processing phoneme tokens (and thus command words) in an expected order further simplifies computation and

Developing with Cyberon DSpotter

Cyberon DSpotter is built on a

Figure 5: The R7FA4W1AD2CNG MCU provides ample resources to build a non- Cloud VUI for applications like a smart light switch . Image source: Renesas

to appreciate that there are limitations to the phoneme

One design suggestion is to add a visual indicator to the VUI (for example, an LED) to indicate when the processor assumes it is at the top level of the CommandSet, prompting the user to reissue the command in the logical sequence (Figure 4). Running a non-Cloud VUI with restricted resources The efficiency of Cyberon DSpotter allows it to run on Renesas’ RA2, RA4, and RA6 families of Arm Cortex-M MCUs. These are popular

makes it possible to run VUI software locally on a modest MCU (Figure 2). This means that the software efficiencies achieved by using phonemes allow the processing to run locally. Removing the need for Cloud processing means there is no requirement for continuous internet connectivity that introduces user privacy and data security concerns.

method for a VUI. The relatively limited resources of the MCU dictate that Cyberon DSpotter is speech recognition rather than voice recognition. This means the software cannot perform natural language processing. Hence, if the command words don’t follow a logical sequence (for example, ‘high’, ‘lightbulb’, ‘set’ instead of ‘lightbulb’, ‘set’, ‘high’), the system won’t recognize the command and will reset back to the top level.

Figure 4: The streamlined nature of Cyberon DSpotter requires that commands follow a logical sequence, or they won’t be recognized. Image source: Renesas

It is important for the developer

we get technical