How to Implement a Voice User Interface on Resource-Constrained MCUs

Contributed By DigiKey's North American Editors

2023-12-14

Smart speakers and other connected hubs form the heart of the smart home, allowing users to control devices and access the Internet. Two trends are apparent as these devices proliferate: users prefer voice control over button presses or complicated menu systems, and there is increasing discomfort with continuous cloud connectivity because of privacy concerns.

However, a robust and secure voice user interface (VUI) typically demands powerful hardware and complex software for voice recognition. Anything less will likely result in poor performance and unsatisfactory user experiences. Also, many smart speakers and hubs are battery powered, so a VUI must be achieved within a tight power budget. Such an ambitious project can be daunting for a developer lacking experience with voice interfaces.

Chip makers are responding by introducing a technique based on phonemes that significantly reduces the processing requirements. The result is highly accurate and efficient VUI software that can run on familiar 32-bit microcontrollers (MCUs) and is supported by easy-to-use design tools.

This article describes VUI challenges and use cases. It then introduces commercial, easy-to-use MCU application software and local phoneme-based VUI software for connected home applications. The article concludes by showing developers how to get started on VUI projects using Renesas MCUs, VUI software, and evaluation kits.

The challenges of building a VUI

A VUI is speech recognition technology that enables interaction with a computer, smartphone, home automation system, or other device using voice commands. After early engineering challenges, the technology has matured into a reliable control interface and is now widely used in smart speakers and other smart home devices. The key benefit of a VUI is its convenience: instant control from anywhere within voice range with no need to use a keyboard, mouse, buttons, menus, or other interfaces to input commands (Figure 1).

Image of VUI technology has been widely adopted Figure 1: VUI technology has been widely adopted in homes and smart buildings because it is convenient and flexible. (Image source: Renesas)

The downside of a VUI is its complexity. Conventional technology is based on the lengthy training of a model with specific words or phrases. But natural language processing is word-order independent, which demands considerable development work and significant computing power to run in real-time. This has slowed the broader adoption of VUIs.

Now, a new technique simplifies VUI software to the extent that it can run on small, efficient microcontrollers (MCUs) such as Arm® Cortex®-M devices. This technique relies on the fact that all words in each spoken language are made up of linguistic sounds called phonemes. There are far fewer phonemes than words; English has 44, Italian has 32, and the traditional Hawaiian language has just 14. If a VUI uses an English command set of 200 words, each word could be broken down into its associated phonemes from the set of 44.

Within VUI software, each phoneme could then be identified by a numeric code (or a “token”), with the various tokens forming the language. Storing words as sounds requires extensive computational resources, and takes up far more memory space than phonemes stored as tokens. Processing phoneme tokens (and thus command words) in an expected order further simplifies computation and makes it possible to run VUI software locally on a modest MCU (Figure 2).

Image of representing words using phonemes Figure 2: Representing words using phonemes demands fewer microcontroller resources. (Image source: Renesas)

This means that the software efficiencies achieved by using phonemes allow the processing to run locally. Removing the need for cloud processing means there is no requirement for continuous internet connectivity that introduces user privacy and data security concerns.

Renesas has shown a commercial VUI software package based on the phoneme principle as part of its ecosystem. The software, called Cyberon DSpotter, creates a VUI algorithm that is streamlined enough to run on Renesas RA series MCUs featuring Arm Cortex-M4 and M33 cores.

Developing with Cyberon DSpotter

Cyberon DSpotter is built on a library of phonemes and phoneme combinations. This is an alternative approach to the traditional and computing-heavy training of algorithms to recognize specific words. To break down words into phonemes and then represent them as tokens, the developer can use the DSpotter Modeling Tool.

DSpotter is embedded (non-cloud) software that works as a local voice trigger and command-recognition solution with robust noise reduction. It consumes minimal resources and is highly accurate. Depending on the selected MCU, secure data transfer can also be implemented.

DSpotter asks for each command word or phrase, which the tool breaks down into phonemes. The command set and supporting data for the VUI are then built into a binary file that the developer includes in the project along with the Cyberon library. The library and the binary file are used together on the MCU to support the recognition of the desired speech commands.

The DSpotter tool creates “CommandSets” that can be logically connected by the developer’s program to create a VUI with different levels. This allows for multi-level commands such as, “I’d like the lightbulb set to high, please”: the command words being “lightbulb”, followed by “set”, and “high”. Each command in a group has its own index, as does each command within a level (Figure 3).

Image of DSpotter tool allows the creation of “CommandSets” Figure 3: The DSpotter tool allows the creation of “CommandSets” that can be logically connected by the developer’s program to create a VUI with different levels. (Image source: Renesas)

The DSpotter library processes incoming sound and searches for phonemes that match the commands in the database. When it finds a match, it returns with the index and group numbers. Such an arrangement allows the main application code to create a hierarchical switch statement to process the command words/phrases as they come. The resulting library can be small enough to fit on an MCU with just 256 kilobytes (Kbytes) of flash memory and 32 Kbytes of SRAM. The CommandSet can grow if more memory is available.

It is important for the developer to appreciate that there are limitations to the phoneme method for a VUI. The relatively limited resources of the MCU dictate that Cyberon DSpotter is speech recognition rather than voice recognition. This means the software cannot perform natural language processing. Hence, if the command words don’t follow a logical sequence (for example, “high,” “lightbulb,” “set” instead of “lightbulb,” “set,” “high”), the system won’t recognize the command and will reset back to the top level.

One design suggestion is to add a visual indicator to the VUI (for example, an LED) to indicate when the processor assumes it is at the top level of the CommandSet, prompting the user to reissue the command in the logical sequence (Figure 4).

Image of streamlined nature of Cyberon DSpotter Figure 4: The streamlined nature of Cyberon DSpotter requires that commands follow a logical sequence, or they won’t be recognized. (Image source: Renesas)

Running a non-cloud VUI with restricted resources

The efficiency of Cyberon DSpotter allows it to run on Renesas’ RA2, RA4, and RA6 families of Arm Cortex-M MCUs. These are popular across a wide range of consumer, industrial, and IoT applications. They are supported by easy-to-use design tools, making it relatively straightforward to build a simple VUI without extensive coding experience or in-house expertise.

The choice of a particular RA family MCU primarily comes down to the complexity of commands and the Cyberon library's size. A smart light switch, which requires a modest command set and limited computing power to operate effectively, could be based on the R7FA4W1AD2CNG from the RA4 family. This MCU has a battery-friendly 48 megahertz (MHz) Arm Cortex-M4 core supported by 512 Kbytes of flash memory and 96 Kbytes of SRAM. It features a segment LCD controller, a capacitive touch sensing unit, Bluetooth Low Energy (Bluetooth LE) wireless connectivity, USB 2.0 Full-Speed, a 14-bit analog-to-digital converter (ADC), a 12-bit digital-to-analog converter (DAC), plus security and safety features (Figure 5).

Diagram of Renesas R7FA4W1AD2CNG MCU Figure 5: The R7FA4W1AD2CNG MCU provides ample resources to build a non-cloud VUI for applications like a smart light switch. (Image source: Renesas)

A more extensive Cyberon DSpotter library and a more powerful core are needed for an application such as a smart speaker. A suitable candidate is the R7FA6M4AF3CFM. This MCU from the RA6 family features the more powerful 200 MHz Arm Cortex-M33 core supported by 1 megabyte (Mbyte) of flash memory and 256 Kbytes of SRAM. It has a CAN bus, Ethernet, I²C, LIN bus, a capacitive touch sensing unit, and many other interfaces and peripherals.

The RA4 and RA6 families are supported by evaluation boards, the RTK7EKA4W1S00000BJ and the RTK7EKA6M4S00001BE, respectively, to allow a developer to exercise the MCUs’ capabilities. Each evaluation board has the target MCU and an onboard debugger.

Renesas also offers a VUI solution kit to accelerate development. The kit is similar to the evaluation boards in that it incorporates the target device and debuggers. The board also features several I/O interfaces and has four microphones: two analog and two digital.

Access to the software needed for development with the VUI solution kit is available on Cyberon’s website. This includes complimentary Cyberon DSpotter Modeling Tool access and features an e² studio project with a working voice CommandSet (e² studio is an Eclipse-based integrated development environment (IDE) for Renesas MCUs). The example CommandSet can be used as a template for developing custom voice command sequences. The system’s reactions can then be monitored using a terminal window. It generally takes about 15 minutes to create the VUI structure shown in Figure 4.

More sophisticated application software design for the Cyberon package is supported by the company’s Renesas Flexible Software Package (FSP) for embedded system designs using the RA families. The FSP is based on an open software ecosystem and includes Azure RTOS or FreeRTOS, legacy code, and third-party ecosystems. It can run in several IDEs, including e² studio.

How well does the VUI perform?

It is one thing for a VUI to perform well in a quiet laboratory, but quite another for it to work accurately with significant background noise. A typical operating environment for a smart speaker could include a TV or radio, conversation, other music sources, and the general hubbub of a household or a social gathering. Moreover, the VUI will have to contend with dialects and less-than-perfect diction. Despite these challenges, users expect almost flawless performance.

To improve performance in a difficult listening environment, Cyberon DSpotter software running on the Renesas RA family of MCUs includes noise immunity features that require minimal processor resources. To demonstrate its efficacy, tests were done with a Cyberon DSpotter VUI listening to commands while subject to various background noise sources at 1.5 and 3 meter (m) distances, and with signal-to-noise ratios (SNRs) of 0, 5, and 10 decibels (dB). In all cases, the VUI outperformed the Amazon Alexa benchmark (Table 1).

SNR	Background noise	Distance	Hit-rate	Alexa requirements
(Clean)	none	1.5 m	100.00%	90%
(Clean)	none	3 m	100.00%	90%
10 dB	Babble	1.5 m	98.55%	80%
10 dB	Babble	3 m	98.84%	80%
10 dB	Music	1.5 m	98.26%	80%
10 dB	Music	3 m	98.55%	80%
10 dB	TV	1.5 m	98.84%	80%
10 dB	TV	3 m	98.55%	80%
5 dB	Babble	1.5 m	98.84%	80%
5 dB	Babble	3 m	96.24%	80%
5 dB	Music	1.5 m	98.84%	80%
5 dB	Music	3 m	97.08%	80%
5 dB	TV	1.5 m	93.37%	80%
5 dB	TV	3 m	90.72%	80%

Table 1: Command success test results for a Cyberon-powered VUI with various sources of background noise. In all cases, the VUI outperformed the Amazon Alexa benchmark. (Image source: Renesas)

Conclusion

VUIs are rapidly becoming the preferred consumer control interface for smart products. A speech control approach using phonemes as the basis of commands and a strict command structure can dramatically reduce memory and computing requirements, allowing the technology to run locally on small, resource-constrained MCUs.

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of DigiKey or official policies of DigiKey.