Packet Loss
The two major problem areas that affect IP telephony QoS are packet loss and delay. The two QoS impairment factors are sometimes interrelated.
Packet loss causes voice clipping and skips, often resulting in choppy and sometimes unintelligible speech. Voice packets can be dropped if the network quality is poor, the network is congested, or there is too much variable delay in the network. Poor network quality can lead to sessions frequently going out of service due to lost physical or logical connections. To avoid lost or late packets, it is necessary to engineer the IP telephony network to minimize situations that cause the problem, but even the best-engineered system will not stop congestion-induced packet loss and delay. To combat this problem, it is recommended that a buffer be used on the receiving end of a connection. Buffer length must be kept to a minimum because it contributes to end-to-end network delay. Dynamic receive buffers that increase or decrease in size can be used to handle late packets during periods of congestion and avoid unnecessary delays when traffic is light or moderate.
Packet problems that occur at the sending end of a connection can be handled by methods such as interleaving and forward error correction (FEC). Interleaving is the resequencing of speech frames before they are packeted. For example, if each packet has two frames, the first packet contains frames 1 and 3 and the second packet contains frames 2 and 4. If a packet is lost, the missing speech frames will be nonconsecutive and the gaps will be less noticeable to the receiving party. FEC is a method that copies information from one packet to the next packet in the sequence. This allows the copied data to be used in the event a packet is lost or late.
Different methods are used at the receiving end of the connection. Unlike a circuit switched network, a packet switched network breaks communications signals into small samples, or packets of information. Each packet has a unique header that identifies packet destination and provides information on reconstruction when the packet arrives. Packets travel independently across the LAN/WAN and can travel by different routes during a single call. Packets can be lost for two primary reasons: dead-end routes and network congestion. Network congestion can lead to packet drops and variable packet delays. Voice packet drops from network congestion are usually caused by full transmit buffers on the egress interfaces somewhere in the network. The packet is purposely dropped to manage congested links. As links or connections approach 100 percent use, the queues servicing those connections become full. When a queue is full, new packets attempting to enter the queue are discarded. This can occur on an Ethernet switch or IP network router. Network congestion is typically sporadic, and delays from congestion tend to be variable in nature. Egress interface queue wait times or large serialization delays cause variable delays of this type.
DSP elements in most current voice codecs can correct for up to 30 milliseconds of lost voice. If the voice payload sample is no greater than this loss time, the correction algorithm is effective, if only a single packet can be lost during any given time. There are several methods that can compensate for lost or long-delayed packets. It is not practical to search for a lost packet to try to retrieve it. A preferred option is to conceal packet loss by replacing lost packets with something similar. One approach is to replay the last ordered packet in place of the lost one. This is a simple solution that is acceptable for rare packet loss, but a more complex solution is required for situations of frequent packet loss.
Several techniques are available for replacing a lost packet. One technique is to estimate the information that would have been in the packet. This concealment method generates synthetic speech to cover missing data. The concealment technique should have spectral characteristics similar to those of the speaker. This is relatively easy for a CELP-type codec such as G.729A because the speaker’s voice signals are modeled during the encoding process. It is a more difficult process if a waveform codec such as G.711 is used, because the amplitude of the waveform is coded rather than making assumptions about how the sound was produced. G.711 codec packet loss concealment requires more complex processing algorithms and greater memory requirements and adds to system delay. A waveform codec, such as G.711, compensates for this; it can rapidly recover from packet loss because the first speech sample in the first good packet restores the speech to the original, whereas CELP-based codecs require a few frames to catch up.
The concealment process requires the receiver codec to store a copy of the synthetic packet in a circular history buffer that calculates the current pitch and waveform characteristics. With the first bad packet, the contents of the buffer are used to generate a synthetic replacement signal for the duration of the concealment. When two consecutive frames are lost, repeating a single pitch can result in harmonic artifacts, or beeps, that are noticeable when the erasure lands on unvoiced speech sounds, such as s or f, or rapid transitions, such as the stops p, k, and d. Concealment algorithms often increase the number of pitch periods used to create replacement signals when multiple packets are lost. This results in a variation of the signal and creates more realistic synthetic speech.
There must be a smooth transition between synthesized and real speech signals. The first good packet after an erasure needs to be merged smoothly into the synthetic signal. This is done by mixing synthesized speech from the buffer with the real signal for a short time after the erasure period.
Packet loss can become noticeable when a few percentages of the packets are dropped or delayed, and it begins to seriously affect QoS when the percentage of the lost packets exceeds a certain threshold (roughly 5 percent of the packets). Major problems also occur when packet losses are grouped together in large packet bursts. The methods for dealing with packet loss must be balanced against adding delay packet transport between connected parties.
Latency
Delay, commonly referred to as latency, is the time delay incurred in speech by the IP telephony system. It is usually measured in milliseconds from the time a station user begins to speak until the listener actually hears speech. One-way latency is known as mouth-to-ear latency. Round-trip latency is the sum of the two one-way latencies comprising a voice call. Round-trip latency in a circuit switched PBX system takes less than a few milliseconds; PSTN round-trip latency is usually tens of milliseconds but almost always less than 150 milliseconds. Based on formal Mean Opinion Score (MOS) tests, latency at or under 150 milliseconds is not noticeable to most people. Latency up to 150 milliseconds receives good to excellent MOS scores ranging between 4 and 5 (1–5 scale) and provides for a satisfactory IP telephony QoS experience. One hundred fifty milliseconds is specified in the ITU-T G.114 recommendation as the maximum desired one-way latency to achieve high-quality voice. Switched network latency above 250 milliseconds, more common for international calls, becomes noticeable and receives fair MOS scores. Latency above 500 milliseconds is annoying and deemed unsatisfactory for conducting an acceptable conversation.
Latency in an IP telephony network is incurred at several nodal points across the voice call path, including the IP telephony gateways at the transmitting and receiving ends of a conversation. Latency is cumulative, and any latency introduced by any component in an IP telephony system will directly affect the total latency experienced by the station users.
The gateway network interface for an IP peripheral voice terminal may be an integrated component of the telephone instrument or an external device, such as a desktop IP adapter module, or embedded on a port circuit interface card housed in a PBX port carrier. The network interface in a gateway includes any hardware or software that connects the gateway to the telephone system or network. The typical network interface frames and converts digitized audio PCM data streams into the internal PCM bus for transport across a DSP. There is usually very little latency induced in this process, with typical maximums well below 1 millisecond. The DSP function is more complex because it involves compression or decompression of speech, tone detection, silence detection, tone generation, echo cancellation, and generation of “comfort” noise. The entire DSP mechanism is known collectively as vocoding.
DSP operations depend on processing entire frames of data at one time. The side effect of processing data in frames is that none of the data can be processed until the frame is completely full. Digitized speech arrives at a fixed rate of 8,000 samples per second, and the size of the frame processing the data will directly affect the amount of latency. A 100-sample frame would take 12.5 milliseconds to fill, and a 1000-sample frame would take 125 milliseconds to fill. Deciding on the frame size is a compromise: the larger the frame, the greater the DSP efficiency, but with that comes greater latency. Each standard voice coding method uses a standard frame size. The maximum latency incurred by the framing process depends directly on the selection of vocoder.
A G.711 voice codec can be programmed for frame size specifications, and very small frame duration delays can be used. A typical G.711 programmed frame duration is 0.75 milliseconds. A G.723.1 voice codec results in far greater frame delay than a G.729A voice codec, with only a slight comparative bandwidth savings.
After the collection of an entire frame is completed, the DSP algorithms must be run on the newly created frame. The time required to complete the processing varies considerably but never exceeds the frame collection time; otherwise, the DSP would never complete processing one frame before the next frame arrived. A DSP responsible for multiple gateway channels would continually process signals from one channel to the next. The latency incurred due to the DSP process is usually specified as the frame size in milliseconds, although the actual total latency from framing and processing is actually somewhere between the framing size and no more than twice the frame size.
There are three other gateway processes that add to latency: buffering, packetization, and jitter buffer. Buffering can occur when the resulting compressed voice data frames are passed to the network. This buffering is done to reduce the number of times the DSP needs to communicate to the gateway main processor. In other situations, it is done to make the result of coding algorithms fit into one common frame duration (not length).
A multichannel gateway might be operating with different voice codecs on different channels. For example, a universal IP port interface card in a converged IP-PBX system may be handling G.729A off-premises calls across several gateway channels and G.711 premises-only calls across other gateway channels. For example, multiple G.711 frames may be collected for each G.729A frame, irrespective of the coding algorithm, to allow the transfer of one buffer per fixed period of 10 milliseconds.
As coded voice (compressed or noncompressed) is being prepared for transport across a LAN or WAN, it needs to be assembled into packets. This process typically is done by the TCP/IP protocol stack with UDP and RTP. The selection of these protocols improves timely delivery of the voice data and eliminates the overhead of transmission acknowledgments and retries. Each packet has a 40-byte header (combined IP/UDP/RTP headers) that contains the source and destination IP addresses, the IP port number, packet sequence number, and other protocol information needed to properly transport the data. After the IP header, one or more frames of coded voice data would follow.
An important consideration for voice coder selection is the decision of whether to pack more than one frame of data into a single packet. A G.723.1 voice coder (which produces 24-byte frames every 30 milliseconds) would have 40 bytes of header and 24 bytes of data. That would make the header 167 percent of the voice data payload, and a very inefficient use of bandwidth resources. The most common way to reduce the inefficiency of the IP packet overhead is to put more than one coded voice frame into each IP packet. If two frames are passed per packet, the overhead figure drops to 83 percent, but another frame period is added to the latency total. This is a trade-off dilemma of an IP telephony system. To avoid increased latency but reduce overhead, multiple voice frames across gateway channels can be transported in the same packet. When voice from another channel in the originating gateway is going to the same destination gateway, the data can be combined into a single packet. The standard H.323 protocol does not support this latency saving process, but proprietary solutions can implement it.
Jitter buffer latency is based on the variability in the arrival rate of data across the network because exact transport times cannot be guaranteed. Network latency affects how much time a voice packet spends in the network, but jitter controls the regularity at which voice packets arrive. Typical voice sources generate voice packets at a constant rate. The matching voice decompression algorithm also expects incoming voice packets to arrive at a constant rate. However, the packet-by-packet delay inflicted by the network may be different for each packet, resulting in irregular packet arrival at the gateway. During the voice decoding process, the system must compensate for jitter and does this by buffering one packet of data from the network before passing it to the destination DSP. Having these “jitter buffers” significantly reduces the occurrence of data starvation and ensures that timing is correct when sending data to the DSP. Without jitter buffers, there is a very good chance that gaps in the data would be heard in the resulting speech. Jitter buffering improves the speech quality heard by the receiving station user but incurs more latency. The larger the jitter buffers, the more tolerant the system is of jitter in the data from the network, but the additional buffering causes more latency. Most systems use a jitter buffer time of no longer than 30 milliseconds, although 20 milliseconds is the most commonly used time. Jitter buffer time is usually programmable by the system administrator. Jitter buffering can be programmed at the desktop gateway or at any network gateway node.
Beyond gateway latency, there is network latency. Network latency can occur at network interface points, router nodes, and firewall/proxy server points. Network interfaces are points at which data is passed between different physical media used to interconnect gateways, routers, and other networking equipment. Examples are the RS-232C modem and T1-interface connections to the PSTN or LAN/WAN links. For a connection to a relatively slow analog transmission circuit via a RS-232C modem, a delay of more than 25 milliseconds would occur; a T1-interface connection might incur a 1-millisecond delay; and a 100-Mbps Ethernet connection might incur a delay of less than 0.01 millisecond based on a 100-byte data packet.
Routing latency can be incurred because each packet is examined for address destination and overhead headers before directing the packet to the proper route. The queuing logic used by many currently installed routers was designed without considering the needs of IP telephony. There are problems resulting from the real-time requirement of voice communications. Many existing routers use best-effort routing, which is far from ideal for latency-sensitive voice traffic. The current IP routers support priority programming, the absence of which results in the router delaying all data during congestion situations, irrespective of the application. For example, routers supporting the IETF’s RSVP allow a gateway-to-gateway connection to establish a guaranteed bandwidth commitment on the intermediate network equipment, which would dramatically reduce the variability in packet delivery and improve the QoS. Multi-Protocol Label Switching (MPLP) is another recent router programming tool that can reduce routing latency.
Network firewalls or proxy servers that provide security between the corporate intranet and Internet must examine every incoming and outgoing IP packet. This process can incur a sizable amount of latency, so their use is almost always avoided in IP telephony applications. Routers with packet filter features can support some network security without significant added latency. Stand-alone firewalls or proxy servers must receive, decode, examine, validate, encode, and send every packet. A proxy server provides a very high level of network security but can incur more than 500 milliseconds of latency. This is not a problem to the Web-browsing applications for which proxy servers were designed, but it is clearly unacceptable for real-time voice communications requirements. This is one reason using the relatively insecure Internet as a voice network is not yet practical.
When all latency elements are added up, one-way latency can seriously affect IP Telephony QoS.