Purpose
This white paper describes the protocols
involved in the transmission of voice samples through an IP based network.
This document aims to give the reader the basic grounding that is required
to further investigate the bandwidth requirements of voice over IP.
This paper does not discuss header
compression schemes, and does not discuss layer 2 protocols.
Furthermore, this paper only considers IPv4 and not IPv6.
Layered
model
In common with many
communications systems, the protocols involved in Voice over IP (VoIP)
follow a layered hierarchy which can be compared with the theoretical
model developed by the International Standards Organisation (OSI seven
layer model). Breaking a system into defined layers can make
that system more manageable and flexible. Each layer has its job,
and does not need a detailed understanding of the layers around it.
For example, IP datagrams can be
transported across a variety of link layer systems including serial lines
(using PPP), Ethernet and Token Ring. The link layer protocol is for
the most part irrelevant to IP (unless that protocol limits the size of
its datagrams), and need not be the same for the first link of a Voice
over IP call and the final link of a VoIP call.
As always there are exceptions (such as IP
over ATM), but the simple discreet layered model will be considered in
this document.
The effect of each layer's contribution
the the communication process is an additional header preceding the
information being transmitted. The complete packet which a layer
creates (header and data) becomes the data passed to the next level for
processing. That layer will then add a header portion, and so on...
Each layer, started at the Network (or
Internet) Layer are considered in the sections which follow.
IP
(Internet Protocol)
The Internet Protocol is the
lowest level protocol considered in this document. It is responsible
for the delivery of packets (or datagrams) between host
computers. IP is a connectionless protocol, that is, it does not
establish a virtual connection through a network prior to commencing
transmission; this is the job for higher level protocols.
IP makes no guarantees concerning
reliability, flow control, error detection or error correction. The
result is that datagrams could arrive at the destination computer out of
sequence, with errors or not even arrive at all. Nevertheless, IP
succeeds in making the network transparent to the upper layers involved in
voice transmission through an IP based network.
Any Voice over IP transmission must use IP
(by definition). IP is not well suited to voice transmission.
Real time applications such as voice and video require guaranteed
connection with consistent delay characteristics. Higher layer
protocols address these issues (to a certain extent).
The diagram below shows the header that
proceeds the data payload to be transmitted. In its most basic
form, the header comprises 20 octets. There are optional fields
which can be appended to the basic header, but these offer additional
capabilities which are not necessary for VoIP transmission as described in
this document.
The fields shown are briefly described
below:
-
The version of IP being used. For
this format header, the version would be 4.
-
The length of the IP header in units of
four octets (32 bits). For the basic header shown in this
diagram, the value would be 5 (each line in the diagram represents
four octets).
-
Specifies the quality of service
requested by the host computer sending the datagram. This is not
always effectively supported by routers or Internet Service Providers.
-
The length of the datagram, measured in
octets, including the header and payload.
-
As well as handling the addressing of
datagrams between two computers (or hosts), IP needs to
handle the splitting of data payloads into smaller packages.
This process, known as fragmentation, is required because,
although a single IP datagram can handle a theoretical maximum length
of 65,515 octets, lower link layer protocols such as Ethernet cannot
always handle these large packet sizes. This field is a unique
reference number assigned by the sending host to aid in the reassembly
of a fragmented datagram.
-
These flags indicate whether the
datagram may be fragmented, and, if it has been fragmented, whether
further fragments follow this one.
-
This field indicates where in the
datagram this fragment belongs. It is measured in units of 8
octets (64 bits).
-
This field indicates the maximum time
the datagram is permitted to remain in the internet system. This
parameter ensures that a datagram which cannot reach its destination
host is given a finite lifetime.
-
This indicates the higher level
protocol in use for this datagram. Numbers have been assigned
for use with this field to represent such transport layer protocols as
TCP and UDP.
-
This is a checksum covering the header
only.
-
The IP address of the host which
generated this datagram. IPv4 addresses are 32 bits in length
and, when written or spoken, a dotted decimal notation is
used (e.g.: 192.168.0.1).
-
The IP address of the destination host.
UDP
(User Datagram Protocol)
Generally, there are two
protocols available at the transport layer when transmitting information
through an IP network. These are TCP (Transmission Control
Protocol) and UDP (User Datagram Protocol). Both protocols enable
the transmission of information between the correct processes (or
applications) on host computers. These processes are associated with
unique port numbers (for example, the HTTP application is usually
associated with port 80).
TCP is a connection oriented protocol;
that is, it establishes a communications path prior to transmitting data.
It handles sequencing and error detection, ensuring that a reliable stream
of data is received by the destination application.
Voice is a real-time application, and
mechanisms must be in place with ensure that information is received in
the correct sequence, reliably and with predictable delay characteristics.
Although TCP would address these requirements to a certain extent, there
are some functions which are reserved for the layer above TCP.
Therefore, for the transport layer, TCP is not used, and the alternative
protocol, UDP, is commonly used.
In common with IP, UDP is a connectionless
protocol. UDP routes data to it's correct destination port, but
does not attempt to perform any sequencing, or to ensure data reliability.
The fields shown are briefly described
below:
-
Identifies the higher layer process
which originated the data.
-
Identifies with higher layer process to
which this data is being transmitted.
-
The length in octets of the UDP data
and payload (minimum 8).
-
Optional field supporting error
detection.
RTP
(Real-time Transport Protocol)
Real time applications require mechanisms
to be in place to ensure that a stream of data can be reconstructed
accurately. Datagrams must be reconstructed in the correct order,
and a means of detecting network delays must be in place.
Jitter is the variation in delay
times experienced by the individual packets making up the data stream.
In order to reduce the effects of jitter, data must be buffered at the
receiving end of the link so that it can be played out at a constant rate.
To support this requirement, two protocols have been developed.
These are RTP (Real-time Transport Protocol) and RTCP (RTP Control
Protocol).
RTCP provides feedback on the quality of
the transmission link. RTP transports the digitised samples of real
time information. RTP and RTCP do not reduce the overall delay of
the real time information. Nor do they make any guarantees
concerning quality of service.
The RTP header, which precedes the data
payload, is shown in the diagram below:
-
Identifies the version of RTP
(currently 2).
-
A flag which indicates whether the
packet has been appended with padding octets after the payload data.
-
Indicates whether an optional fixed
length extension has been added to the RTP header.
-
Although not shown on this header
diagram, the 12 octet header can optionally be expanded to include a
list of up to contributing sources. Contributing sources are
added by mixers, and are only relevant for conferencing application
where elements of the data payload have originated from different
computers. For point to point communications, CSRCs are not
required.
-
Alllows significant events such as
frame boundaries to be marked in the packet stream.
-
This field identifies the format of the
RTP payload and determines its interpretation by the application
-
A unique reference number which
increments by one for each RTP packet sent. It allows the
receiver to reconstruct the sender's packet sequence.
-
The time that this packet was
transmitted. This field allows the received to buffer and
playout the data in a continuous stream.
- Synchronisation
source (SSRC) number
-
A randomly chosen number which
identifies the source of the data stream.
-
-
-
The headers of the three payload
carrying protocols discussed are sent sequentially before the digitised
voice or video samples, which are actually the payload the RTP header.
The result is a 40 octet overhead for
every packet of data:
RTP
payload
The IP, UDP and RTP headers
are followed by the data payload of the RTP header. This comprises
digitised samples of voice and video. The length of these samples
can vary, but for voice, samples representing 20ms are considered the
maximum duration for the payload.
The selection of this payload duration is
a compromise between bandwidth requirements and quality. Smaller
payloads demand higher bandwidth per channel band, because the header
length remains at forty octets. However, if payloads are increased,
the overall delay of the system will increase, and the system will be more
susceptible to the loss of individual packets by the network.
This subject is discussed in more detail
in the white paper Bandwidth
requirements for Voice over IP transmission.