Voice Initiated: Processing Voice Signals for Intelligent Applications


Improving the clarity of speech signals for hands-free Voice-assist applications is critical to today’s voice initiated intelligent applications. As the Internet of Things (IoT) technologies penetrate our daily lives and homes, the intelligent voice market has become a critical component of these technologies.

Parks Associates  (www.parksassociates.com) announced Voice Control as the No.1 consumer IoT trend at CES®2017, citing “Voice control is vying to become the primary user interface for the smart home and connected lifestyle”. (Parks Associates announces “Top 10 Consumer IoT Trends in 2017” and players to watch in the new year, 2016).

Voice assist applications have expanded well beyond the smartphone into diverse markets vying for inclusion in the home and business.  User experience differentiates and defines the better product. Consumers expectation of reliable and accurate voice control applications is driving development.

Intelligent voice-assist applications include both currently on the market 

and in development, but are not limited to AI voice assistants, doorbell/intercom, smart-home, security systems, smoke detectors/alarms, medical information relay systems, baby monitors, remote classroom, smart appliances, and home security.

In a voice-assist application, the speech stream is generally initiated a distance from the voice assist microphone. This is referred to as Far-Field speech. In many voice assist applications, a Keyword sometimes referred to as Wake-word, must be recognized by the application.


Speech recognition performance degrades drastically under noisy and reverberant environments. As in any home, office, or even outdoor application, sound is all around us. The greater distance a speaker is from a microphone, the greater the level of distortion with the addition of the ambient noise streams. Background noises, such as a running dishwasher, television set, children playing, dogs barking, need to be removed from the sound stream so that the keyword can be distinguished from other speech signals by the application.

 The mixing of background noise with the speech of interest results in a dramatic decline of speech recognition accuracy in the presence of noise and reverberation. This is especially true when the background noise is itself speech. The effect worsens as the distance between the talker and the microphone increases.

Adaptive Digital uses certain algorithms that recognize the dominant voice and suppress background chatter noise. The Far-field Voice Input Processing software first detects far-field speech, then reduces the clutter in the voice application can send a clear voice signal, or distinguish a wake-word from other noise sources.

For certain environments, a microphone array may be employed for voice capture. In a microphone array, a number of microphones can be arranged in either a circular, or linear pattern and used to pick up speech signals via phase steering.   Essentially, the microphones, while not physically pointing in any specific direction will point acoustically in one or many directions.   When a voice command emanates from a particular direction, the clutter noise on the periphery of that direction is either reduced or not picked up by the microphone array.

The number of microphones and the distance between them in the array will affect the accuracy, frequency and direction of the directional beam.

Far field voice processing
Acoustic echo: Local Loopback

Beamforming software improves the signal to noise ratio of speech signals by exploiting the phase arrivals of the speech at each microphone. The beamforming algorithm causes the microphone gain to be maximum in the direction voice of the dominant speaker. By increasing the gain in that direction while reducing the gain in the direction of the reflective paths, the signal-to-interferer ratio is increased, which reduces the reverb effect. 

Acoustic beamforming software is attached as a pre-processor. With the addition of beamforming software, a more robust solution is required.

Translate »