Have a VoIP question? Answers to many of your inquires.
THE FOLLOWING ARE FAQS HAVE BEEN ANSWERED BY ADAPTIVE DIGITAL ENGINEERS, EXPERTS IN THE FIELD OF TELECOMMUNICATIONS
THERE ARE MANY QUESTIONS THAT YOU SHOULD ASK YOURSELF AND MANY ASPECTS TO CONSIDER:
Interoperability: Does your system interoperate with other equipment? If so, you probably need to support at least one vocoder that is supported by the other equipment.
Bit Rate: How many bits per second can you afford in your communication link?
Speech Quality: There is a wide variation in speech quality between the various vocoders. In general, speech quality goes down as bit rate goes down.
Complexity: The amount of Memory and MIPS required by the various vocoders can vary quite a bit. A less complex algorithm can run on a less expensive DSP with less RAM and lower clock speed. Stated differently, you can fit more channels of a less complex algorithm onto a given DSP.
Audio Bandwidth: The traditional telephone system has an audio bandwidth ranging from 300 Hz to 3300 Hz. Most vocoders are therefore designed to operate on this audio band at a sampling rate of 8 kHz. Some algorithms operate at twice the sampling rate (16 kHz), giving an audio bandwidth of approximately 7 kHz. These vocoders are useful when sending music or when higher voice quality is desired. Keep in mind that if your system supports 7 kHz of audio bandwidth but you are connecting your system to a traditional telephone system, the net effect is still the smaller bandwidth. Beyond these audio codecs, there are even wider band options such as MP3, AAC, and AMR WB+, all of which operate at sampling rates as high as 48 kHz and beyond. These are typically used for CD quality music.
Discontinuous Transmission (DTX): DTX is employed in some vocoders to reduce the bit rate during non-voiced periods. In a two-party phone call, only one person typically speaks at any given time. DTX exploits the gaps in voice by transmitting less data during these times. The more sophisticated DTX algorithms are able to maintain the level and spectrum of the original background noise during non-voiced periods.
Packet Loss Concealment (PLC): In VoIP and other systems, data is often lost. Some vocoders include PLC as a method to minimize the loss in voice quality due to lost data.
Voiceband Modems: Vocoders are tuned to pass voice, but not necessarily voiceband modem signals. Low bit rate vocoders have an especially difficult time passing voiceband modem signals.
Tone Passing: Low bit rate vocoders do not pass in-band tonal signaling such as DTMF tones properly. Many vocoders distort these tones beyond the ability of detectors to recognize them reliably. If you choose a vocoder that is not able to pass tones reliable, you can take care of this problem using a tone relay (tone bypass) algorithm.
Frame Size / Delay: Most vocoders operate on a frame-by-frame basis, where a frame is defined to be a set of a given number of samples. In order to begin processing a frame of data, the vocoder must wait for all the samples to be available. This introduces a delay into a communication system. This data buffering delay is incurred at both the transmit side and the receive side. The total delay introduced into a system is therefore usually twice the frame size.
Being involved in the field of real-time embedded software, we are often asked about how much of a CPU’s capability is required to run one or more instances of our algorithms. A CPU’s capability is a function of many things: clock speed, instruction set architecture, internal and external bus width, cache performance, external memory speed and width, etc.
Clock speed is measured in cycles per second. The measurement that we use to characterize the utilization of our algorithms is clock speed, measured in MHz. Given today’s DSP speeds, MHz (1 Million Cycles Per Second) is generally used as the unit of measurement, although some DSPs are also specified in GHz (1 Billion Cycles Per Second). For our purpose, we define this as “Millions of Instruction Cycles Per Second” or MIPS. But many CPUs can perform more than one mathematical operation in a single instruction cycle. For example, the TI C6000 series can perform 8 operations per instruction cycle. So TI lists a 500 MHz C6000 with a rating of 4000 MIPS. The problem with this characterization is that no realistic program can be written to make use of 8 instructions per cycle in every instruction cycle. Furthermore, there is no tool available to characterize an algorithm’s CPU utilization in terms of MIPS the way TI defines MIPS.
Our solution to this is in our definition of MIPS for the purpose of characterizing our algorithms. That is:
MIPS = MHz
If we say that an algorithm requires 10 MIPS (per channel) to operate on a particular C6000 device, that’s the same as saying that it requires 10 MHz. Stated differently, a 500 MHz C6000 can handle 50 channels of that algorithm. For example, if an algorithm requires 10 MIPS per channel and a DSP can run 600 MIPS, it is possible to run 60 simultaneous channels of that algorithm on a single DSP.
Some DSP vendors do not adopt this convention. Instead, if a DSP can perform multiple instructions in a single cycle, they multiply the clock speed by the maximum number of instructions per cycle. For example, if a DSP clock speed is 600 MHz and it can execute up to 8 instructions per clock cycle, the vendor will rate the DSP at 4800 MIPS. We believe this can cause quite a bit of confusion. Although the DSP may be able to execute 8 instructions per clock cycle on occasion, it cannot sustain this rate for a real algorithm. The average number of instructions per clock cycle is far lower, but it is not a fixed number. We therefore believe it is simpler to equate MIPS to MHz.
In conclusion, when selecting a DSP and you are looking at our MIPS specs, remember the golden rule. MIPS = MHz.
Program Memory – (also known as text, code)
Data Memory – data memory that is common across multiple channels. Tables, .cinit. const, etc.)
Scratch Data – data memory that does not need to be preserved from one algorithm call to the next. Scratch data can be shared with other algorithms or other instances of the same algorithm in a non-preemptive environment.
Per Channel Data – data memory that needs to be preserved from one algorithm call to the next.. Also known as instance data or persistent data.
Heap – Persistent – dynamically allocated memory that must be maintained from one algorithm call to the next. This would typically be allocated at initialization time. (Rarely used by Adaptive Digital algorithms)
Heap – Scratch – dynamically allocated memory that is allocated on the fly and freed before the return from any algorithm functions. (Rarely used by Adaptive Digital algorithms)
MOS is a method used for evaluating speech quality. An algorithm is assigned a score on a scale from 1 to 5 where 5 is best and 1 is worst. MOS scoring is done by recording many speech samples that have been processed through an algorithm, and letting human listeners rate the speech quality. The scores are combined to form the overall MOS score. MOS can be estimated using algorithmic (non-human) analysis. One such analysis technique is named Perceptual Evaluation of Speech Quality (PESQ).
Algorithms can be divided into two categories; we’ll call them bit-exact and non-bit-exact for the purpose of this discussion. Bit-exact algorithms are usually specified by a standards organization such as the ITU or ETSI in a bit-exact fashion. In other words, for a given input stream, the output must be identical to that produced by the standard’s reference algorithm. Bit-exact algorithms are usually defined in terms of fixed-point “C” code. The significance of the bit-exact nature of these algorithms is that all implementations will produce identical results.
Since this is the case, the difference between competing products lies in the processor utilization (MIPS, Memory), ease of use, and support. Typically, vocoders are specified in a bit-exact fashion. Our algorithms exceed in all these areas.
Non-Bit Exact algorithms are either not defined by a standards organization, or are defined by a standards organization using a method that does not guarantee bit-exact performance. In the later case, a standard may specify minimum measurable requirements and methods for performing tests and making the measurements.
Non-Bit-Exact algorithms can be differentiated in the same ways as Bit-exact algorithms (MIPS, Memory, ease of use, and support) PLUS differences in voice quality and by how much the algorithm EXCEEDS the minimum requirements. Algorithms that are not specified in a bit-exact fashion include echo cancellers, tone detectors and generators, and noise reduction.
When it comes to these algorithms, Adaptive Digital truly excels. We strive for the best possible voice quality as well as the most reliable tone detectors while minimizing DSP resource utilization.
In the DSP software world, an algorithm is a function or set of functions that processes an input and produces an output. A simple example is an FIR filter. Other examples are vocoders, echo cancellers, tone detectors, tone generators, etc.
Adaptive Digital offers licenses many algorithms, primarily in object code library format. Developers can make use of these libraries by linking the libraries into their own DSP applications.
Some developers prefer not to develop their own DSP application. In such cases, Adaptive Digital offers turnkey DSP solutions. In this case, Adaptive Digital supplies a complete application that can be downloaded to a DSP and executed upon reset. Adaptive Digital supplies an API in “C” source code format that makes it easy to interface with the DSP application – typically through the DSP’s host port interface.
Since these two models are not ideal for all developers, Adaptive Digital offers more options.
Source Code: The developer can license source code if he/she wants to modify the algorithm or turnkey application
G.PAK ™: G.PAK is a turnkey solution that can be custom-built using a Windows GUI. The GUI allows the user to configure the application, including necessary features and excluding unnecessary features, thereby taking a minimum of DSP resources. Using the GUI, the user gets the best of both worlds – flexibility without having to touch the inside of the DSP.
VoIP Engine™: VoIP Engine (VE) is at the core of our Linux, and ARM-based VoIP applications, it provides complete PCM to packet processing. The VoIP Engine software is a software engine package that handles all the voice processing from PCM to Packet and back. Its intended use is in VoIP enabled handsets or desktop phones.
VoIP Engine is a software framework that takes an audio stream and performs all the processing necessary to hand off an RTP packet to the network stack, and the same in the opposite direction. VoIP Engine includes the all-important acoustic echo cancellation plus speech compression (G.711, G.729AB, G.722), noise reduction, automatic gain control, RTP and Jitter Buffering. Beyond that, we offer sample apps on Linux and Android, in source code format, that can give app developers a jump start, enabling them to focus on the app rather than the voice transport. The engine itself is offered as an object code library.
VoIP Engine is supplied with a sample Java application and a sample native application that in turn interfaces with the VoIP Engine software. The sample Java application interfaces with the sample native application via Java Native Interface (JNI) to setup an RTP/IP to RTP/IP VoIP connection.
G.PAK™ is a VoIP framework that allows us to create custom VoIP software solutions on TI DSPs. G.PAK is packaged in a number of ways.
- G.PAK: When we refer to a G.PAK solution, we are referring to a case in which we use the G.PAK configuration tool to build a custom solution for you.
For example, you may want us to build a solution on a C6416 DSP that includes G.711, G.729AB, DTMF tone detection, and G.168 echo cancellation. It may include 16 channels with both TDM and packet channel interfaces.
We would build the solution for you and deliver a downloadable binary image, ANSI “C” API code that runs on your host controller to facilitate interfacing to the G.PAK software on the DSP, and the associated documentation.
The API software includes functions to download the binary image to the DSP chip, set up the chip, set up and tear down channels, read and write packets, read tone detect information, generate tones, read status information, etc. In short, a G.PAK solution is an SoC, customized to your needs.
- G.PAK demo: The G.PAK demo on our web site is truly just a demonstration. If you have a C6416 DSK board, you can download the demonstration, build a customized G.PAK software image, download it onto the DSK, and run speech signals through G.PAK in real-time.
By going through this process, you will see that G.PAK can be customized using a Windows-based utility that customizes the build under TI’s Code Composer Studio IDE. But, the downloadable G.PAK demonstration is truly for demonstration only. It includes restricted versions of a few of the algorithm (vocoder, etc.) libraries that work only for 15 minutes per channel setup. Furthermore, there is little application source code available for you to modify.
Adaptive Digital offers flexible licensing models. At one end of the spectrum is the one-time model. In a one-time fee model, the customer pays for the license entirely up-front and pays no per-channel or per-chip royalties. At the other end of the spectrum is the royalty-based model. In this case, the customer pays on a per-channel or per-chip basis at the time when the customer manufactures his/her end product. This model is similar to purchasing a chip. Somewhere in between these two models lies a model where a low up-front fee is paid along with a per-channel or per-chip royalty.
Typically, Adaptive Digital’s products come with 90 days of e-mail and phone support to assist customers through the integration process. Beyond the initial 90 days, customers may purchase maintenance agreements on an annual basis.
There are two general types of echo cancellers found in communications systems: Line (or Network) Echo Cancellers and Acoustic Echo Cancellers.
Line echo cancellers cancel echo that is caused by 2-wire to 4-wire hybrid circuits (SLICs, DAAs).
Acoustic echo cancellers cancel acoustic echo that is caused by speaker to microphone feedback in a hands-free telephone or intercom environment.
The line and acoustic echo characteristics are so different that different echo canceller algorithms are warranted. For example, the acoustic echo environment can change rapidly and often. Even if the speaker and microphone are not moving, objects in their vicinity can move, including people. Also, the acoustic environment is very much a function of the surroundings. A large conference room with many windows is a harsher environment than a small office or automobile.
A line echo canceller is used when the echo canceller is electrically close to the hybrid circuit that causes the echo because a line echo canceller is designed to handle echoes that are delayed by a relatively short time period – typically 32 milliseconds.
A network echo canceller need not be located close to the hybrid circuit. It can be located on the opposite side of a packet network, for example, where the round-trip is relatively large. Network echo cancellers typically handle echoes that delayed by up to 128 milliseconds. Furthermore, Network echo cancellers are designed to cancel reflections from multiple echo sources at differing delays within the network.
You should look for a number of things:
Certification. You want an echo canceller that has been independently certified by a reputable company or lab.
Deployment. You want to select an echo canceller that has been well deployed by many companies in many countries.
G.168 compliance. This seems obvious, but not only should an echo canceller meet (hopefully exceed) the G.168 requirements, it should do so under all test conditions including all the ITU specified hybrid models. Many cancellers are only tested under the least difficult conditions.
Subjective Testing: It is well known that an echo canceller can meet all G.168 requirements and still not sound good. Some cancellers have artifacts that are missed by G.168 but picked up under live conditions with real people talking and listening. It is therefore important that subjective testing be done – ideally a methodical test using a method such as the Mean Opinion Scoring (MOS) method, and ideally by an independent test lab.
When you add many parties to a voice conference, you can run into trouble if you simply sum all the signals together. First of all, you will accumulate the noise from all parties. Second, you will most likely start clipping the signal. (Even if you do a simple sum, you must be sure to subtract each party’s signal from the sum before feeding the signal back to each party.) If you are performing 3-way calling, the summation is probably good enough.
As the number of parties increases, you will most likely want to add more sophisticated features such as voice activity detection, noise suppression, automatic level control, and dominant speaker identification. These are features that you will find in a high quality conferencing algorithm. Adaptive Digital’s conferencing algorithm has been licensed by many equipment manufacturers as well as by one of the largest VoIP semiconductor companies.
Adaptive Digital’s DTMF detector not only requires the lowest MIPS in the industry, it complies fully with applicable standards, it exhibits excellent talkoff performance, and it is very well deployed. (Talkoff is a test that is run on DTMF detectors to ensure that the probability that a DTMF tone will be detected falsely due to speech signals is very small.) Beyond standards compliance, our DTMF detector has seen many real-world conditions where DTMF generators do not comply with specifications and networks are sub-par. Our detector is designed to handle real-world conditions and is configurable to be able to handle sub-par conditions.