Rtp protocol ports. The use of Internet protocols in IP telephony

9. List of protocol constants

10. Description of the traffic profile and format

11. RTP profile for audio and video conferencing with minimal control

12. List of terms and abbreviations used

11.1. RTP and RTCP packet formats and protocol parameters

11.2. Registration of traffic types

11.3. Audio coding

11.4. Video coding

11.5. Port assignment

Principles of building the RTP protocol

Work control methods

RTP header format

Resource Reservation Protocol - RSVP

Literature

2.3.3. TCP protocol

2.3.4. UDP protocol

2.3.5. RTP and RTCP protocols

1. Basic concepts of telephony

2. Description of a bundle of SIP / SDP / RTP protocols

3. Transfer of information about pressed buttons

4. Voice and fax transmission

5. Digital signal processing (DSP). Ensuring sound quality in IP telephony, test examples

Basic concepts

Group audio conferencing

Video conferencing

Understanding mixers and translators

RTCP control protocol

RTCP Packet Rate

home / Education/ Rtp protocol ports. The use of Internet protocols in IP telephony

This section discusses some aspects of transporting RTP packets by network and transport protocols. Unless otherwise specified by other protocol specifications, the following general rules apply when transmitting packets.

RTP relies on underlying protocols to provide separation between RTP data streams and RTCP control information. For UDP and similar protocols, RTP uses an even port number, and the corresponding RTCP stream uses a port number one higher.

RTP information packets do not contain any length field, therefore RTP relies on an underlying protocol to provide an indication of the length. The maximum length of RTP packets is limited only by the underlying protocols.

Multiple RTP packets can be carried in a single lower layer protocol data unit, such as a UDP packet. This reduces the redundancy of headers and can simplify synchronization between different streams.

This section contains a list of constants defined in the RTP protocol specification.

RTP traffic type constants (PT - payload type) are defined in profiles. However, the RTP header octet, which contains the marker bit (s) and the traffic type field, MUST not contain the 200 and 201 (decimal) reserved values to distinguish RTP packets from RTCP SR and RR packets. For a standard format with one token bit and a seven-bit traffic type field, this limitation means that traffic types 72 and 73 should not be used.

The values for RTCP packet types (see Table 1) are selected in the range from 200 to 204 for better control the correctness of the header of RTCP packets when compared with RTP packets. When the RTCP packet type field is compared to the corresponding octet of the RTP header, this range corresponds to a marker bit of one (which is usually not the case in data packets) and the most significant bit of the standard traffic type field equal to one (while statically specified traffic types usually have values of PT with a zero in the most significant digit). This range has also been chosen to further distance it from the values 0 and 255, since fields that are entirely zeros or ones are mostly data-specific.

Other RTCP packet types are defined by the IANA Community. Developers have the ability to register the values they require to conduct experimental research, and then cancel the registration as the need for those values disappears.

The allowed types of items in the SDES package are presented in table. 2. Other types of SDES clauses are designated by the IANA Community. Developers have the ability to register the values they need when performing experimental research and then unregister when those values are no longer needed.

As noted above (see Section 2), for full description RTP protocol for a specific application requires additional documents of two types: a description of the profile and the format of the traffic.

RTP can be used for many classes of applications with vastly different requirements. The flexibility to adapt to these requirements is provided by the use of different profiles (see). Typically an application uses only one profile, and no explicit indication of which profile is in this moment in use - no.

An additional document of the second type, Traffic Format Specification, defines how a particular type of traffic (eg, video encoded according to H.261) is to be transmitted in accordance with RTP. The same traffic format can be used for multiple profiles and can therefore be defined independently of the profile. Profile documents are only responsible for conforming to this format and PT value .

The following items may be defined in the profile description, but this list is not exhaustive.

RTP data packet header. The octet in the header of an RTP data packet, which contains a token bit and a traffic type field, can be redefined according to the profile to meet different requirements, for example, to provide more or fewer token bits (section 3.3).

Traffic types. A profile typically defines a variety of traffic formats (eg, media coding algorithms) and a default static mapping of these formats and PT values. Some of the traffic formats can be identified by reference to individual traffic format descriptions. For each specific type of traffic, the profile must specify the required RTP timestamp clock rate to use (Section 3.1).

RTP data packet header additions. Additional fields can be appended to the fixed header of the RTP data packet if some additional functionality is required within the profile's application class, regardless of the type of traffic. .

RTP data packet header extensions. The contents of the first 16 bits of the RTP Data Packet Header Extension Structure shall be specified if the use of this mechanism is allowed by the profile. .

RTCP packet types. New application class-specific RTCP packet types can be defined (and registered by the IANA).

RTCP reporting interval. The profile should define the values to use in calculating the RTCP reporting interval: the fraction of the RTCP session bandwidth, the minimum reporting interval, and the split of the bandwidth between senders and receivers.

Extension of the SR / RR package. If there is additional source or destination information that needs to be transmitted regularly, then an extension section can be specified for RTCP SR and RR packets.

Using SDES. The profile can define relative priorities for RTCP SDES items to be passed on or dropped (see section 4.2.2); alternative syntax or semantics for CNAME clause (section 4.4.1); LOC item format (section 4.4.5); semantics and use of NOTE clause (section 4.4.7) and new SDES clauses to be registered with IANA.

Safety. The profile can define which security services and algorithms applications need to use, and can control their use (clause 7).

Password-to-key matching. The profile can determine how a user-entered password is converted to an encryption key.

Lower layer protocol. RTP packets may require the use of a specific underlying network or transport layer protocol.

Transport compliance. MAY be defined other than the standard RTP and RTCP correspondences specified in clause 8 to transport layer addresses, such as UDP ports.

Encapsulation. RTP packet shaping can be defined to allow the transmission of multiple information packages RTP in one lower layer protocol data unit (section 8).

Every application you develop should not require a new profile. It is more expedient to expand an existing profile within one class of applications, rather than create a new one. This will make it easier for applications to interact, since each usually runs under only one profile. Simple extensions, such as defining additional PT values or RTCP packet types, can be accomplished by registering them with IANA and publishing their descriptions in a profile specification or in a traffic format specification.

RFC 1890 describes a profile for using the real-time transport protocol RTP version 2 and its associated control protocol RTCP in a group audio or video conference, the so-called RTP profile for audio and video conferences with minimal control (RTP Profile for Audio and Video Conferences with Minimal Control). This profile defines aspects of RTP not specified in the RTP Version 2 Protocol Specification (RFC 1889). The minimum of control means that no support for negotiation of parameters or control of ownership is required (for example, when using static mappings of traffic types and indications of ownership provided by RTCP). Let's consider the main provisions of this profile.

This section contains a description of a number of items that can be defined or changed in a profile.

RTP information packet header. The standard format of the fixed RTP information packet header (one token bit) is used.

Traffic types. The static values of the traffic types are defined in sections 11.3 and 11.4.

RTP information packet header additions. No additional fixed fields are appended to the RTP information packet headers.

RTP information packet header extensions. No RTP information packet header extensions are defined, but applications using this profile MAY use such extensions. That is, it should not be assumed in applications that the X bit of the RTP header is always zero. Applications must be prepared to ignore the header extension. If a header extension is specified in the future, then the contents of the first 16 bits must be specified so that many different extensions can be identified.

RTCP packet types. No additional RTCP packet types are defined in this profile specification.

RTCP reporting interval. The constants proposed in RFC 1889 MUST be used when calculating the RTCP reporting interval.

SR / RR package extensions. No extensions are defined for RTCP SR and RR packets.

Using SDES. Applications can use any of the SDES clauses described. While the canonical name information (CNAME) is sent in every reporting interval, other items should only be sent in every fifth reporting interval.

Safety. The default RTP security services are also defined by default by this profile.

Password-to-key matching. The password entered by the user is converted using the MD5 algorithm into a 16-octet digest. An N-bit key is obtained from a digest by using its first N bits. It is assumed that the password can only include ASCII letters, numbers, hyphens and spaces to reduce the likelihood of distortion when transmitting passwords by telephone, fax, telex or e-mail. The password may be preceded by a specification of the encryption algorithm. Any characters up to the first forward slash (ASCII code 0x2f) are interpreted as the name of the encryption algorithm. If there is no forward slash, the default is DES-CBC encryption.

Before the closing algorithm is applied, the password entered by the user is converted to canonical form. To do this, the password is converted to the ISO 10646 character set using UTF-8 encoding as defined in Appendix P to ISO / IEC 10646-1: 1993 (ASCII characters do not require any conversion); spaces at the beginning and end of the password are removed; two or more spaces are replaced with one space (ASCII or UTF-8 0x20); all letters are converted to lowercase letters

Underlying protocol. The profile defines the use of RTP over UDP in two-way and multicast mode.

Transport compliance. A standard correspondence between RTP and RTCP transport layer addresses is used.

Encapsulation. RTP packet encapsulation is undefined.

This profile defines the standard encoding types used with RTP. Other types of encoding must be registered with the IANA before use. When registering a new type of coding, the following information must be provided:

the code name of the encoding type and the clock rate of the RTP timestamp (the code name should be three or four characters in length to provide a compact representation, if necessary);
an indication of who has the right to change the type of encoding (for example, ISO, CCITT / ITU, other international standards bodies, consortium, specific company or group of companies);
any operating parameters;
links to available descriptions of the encoding algorithm, such as (in order of preference) RFC, published article, patent filing, technical report, codec source, or reference;
for private encoding types, contact information (postal address and email address);
value to indicate the type of traffic of this profile, if necessary (see below).
Note that not all encoding types to be used with RTP need to be statically assigned. Non-RTP means, which are not covered in this article, can be used to dynamically map between a traffic type (PT) value in the range 96 to 127 and an encoding type.
The available value space for traffic types is quite small. New traffic types are assigned statically (permanently) only if the following conditions are met:
coding is of great interest to the community Internet networks;
it offers benefits comparable to existing encodings and / or is required for interoperability with existing, widely used conferencing or multimedia systems;
the description is sufficient to create a decoder.

For applications that do not send packets during pauses, the first active-speech packet (the first packet after the pause) is distinguished by a token bit set to one in the header of the RTP bearer packet. Non-silenced applications set this bit to zero.

The RTP clock used to generate the RTP timestamp is independent of the number of channels and the type of encoding; it is equal to the number of sampling periods per second. For N-channel coding (stereo, quad, etc.), each sampling period (say 1/8000 of a second) generates N samples. The total number of samples generated per second is equal to the sample rate times the number of channels.

When using multiple audio channels, they are numbered from left to right, starting with the first. In RTP audio packets, data from lower numbered channels precedes data from higher numbered channels. For more than two channels, use next systems designations:

l - left;
r - right;
c - central;
S - peripheral;
F - frontal;
R - back.

Number of channels	System name	Channel numbers
Number of channels	System name	1	2	3	4	5	6
2	stereo	l	r
3		l	r	c
4	quad	Fl	Fr	Rl	Rr
4		l	c	r	S
5		Fl	Fr	Fc	Sl	Sr
6		l	lc	c	r	rc	S

Samples of all channels belonging to the same sampling time must be inside the same packet. The interleaving of samples from different channels depends on the type of coding.

The sampling rate must be selected from a variety of: 8000, 11025, 16000, 22050, 24000, 32000, 44100 and 48000 Hz (Apple Macintosh computers have their own sampling rates of 22254.54 and 11127.27, which can be converted to 22050 and 11025 s acceptable quality by skipping four or two samples in a 20 ms frame). However, most audio coding algorithms are defined for a more limited set of sample rates. Recipients must be prepared to receive multichannel audio, but can select mono as well.

For packetization of an audio signal, the default packetization interval shall be 20 ms, unless otherwise specified in the encoding description. The packetization interval defines the minimum end-to-end latency. Longer packets have relatively fewer bytes for the header, but they cause more delay and make the packet loss more significant. For non-interactive applications, such as lectures, or channels with significant bandwidth constraints, higher packetization latency may be acceptable. The recipient should receive packets with a sound signal with a delay of 0 to 200 ms. This limitation provides an acceptable buffer size for the receiver.

In sample-based encodings, each signal sample is represented by a fixed number of bits. Within compressed audio data, individual sample codes may cross octet boundaries. The duration of the signal transmitted in a sound packet is determined by the number of samples in the packet.

For sample-based encoding types producing one or more octets for each sample, samples from different channels sampled simultaneously are packed into adjacent octets. For example, to encode stereo sound, the octet sequence is: left channel, first sample; right channel, first count; left channel, second count; right channel, second count, etc. In multi-octet encoding, the most significant octet is transmitted first. The packing of sample-based encodings producing less than one octet for each sample is determined by the encoding algorithm.

A frame-based coding algorithm converts a fixed-length audio block into another compressed data block, usually also of a fixed length. For frame-based encodings, the sender may combine multiple such frames into one message.

For frame-based codecs, the channel ordering is specific to the whole block. That is, for stereo sound, the samples for the left and right channels are encoded independently; wherein the coding frame for the left channel precedes the frame for the right channel.

All frame-oriented audio codecs must be able to encode and decode multiple consecutive frames transmitted within a single packet. Since the frame size for the frame-oriented codecs is specified, there is no need to use a separate notation for the same encoding, but with a different number of frames per packet.

Table 3 shows the values of the traffic types (PT) defined by this profile for audio signals, their legend and the main specifications coding algorithms.

Table 4 shows the values of coding types (PT), symbols of coding algorithms and technical characteristics of video coding algorithms defined by this profile, as well as unassigned, reserved and dynamically set PT values.

Traffic type values in the range 96 to 127 can be dynamically determined through the conference control protocol, which is not covered in this article. For example, the session directory may specify that for a given session, traffic type 96 denotes dual channel PCMU coding at 8000 Hz. The range of traffic type values marked as "reserved" is not used so that RTCP and RTP packets can be reliably distinguished .

An RTP source at any given point in time only outputs one type of traffic; traffic interleaving different types in one session RTP is not allowed. Several RTP sessions can be used in parallel to carry different types of traffic. The traffic types defined in this profile refer to either audio or video, but not both. However, it is allowed to define combined traffic types that combine, for example, audio and video, with appropriate separation in the traffic format.

Audio applications using this profile must, at a minimum, be able to send and receive traffic types 0 (PCMU) and 5 (DVI4). This allows interoperability without format negotiation.

As defined in the RTP protocol description, RTP data must be sent over an even-numbered UDP port, and corresponding RTCP packets must be sent over an odd-numbered port.

Applications using this profile can use any such UDP port pair. For example, a pair of ports may be randomly assigned by the session management program. A single fixed pair of port numbers cannot be specified, because in some cases several applications using this profile must run correctly on the same host, and some operating systems do not allow multiple processes to use the same UDP port with different multicast addresses.

However, the default port numbers can be 5004 and 5005. Applications that use multiple profiles can select this port pair as an indicator of that profile. But applications can also require a port pair to be specified explicitly.

ASCII (American Standard Code for Information Interchange) is the American standard code for information interchange. 7-bit presentation code text information used with some modifications in most computing systems
CBC (cipher block chaining) - encrypted block chain, DES data encryption standard mode
CELP (code-excited linear prediction) is a type of audio coding that uses code-excited linear prediction
CNAME (canonical name) - canonical name
CSRC (contributing source) - included source. The source of the RTP packet stream that contributed to the combined stream produced by the RTP mixer. The mixer inserts into the RTP packet header a list of SSRC identifiers of those sources that participated in the formation of this packet. This list is called the CSRC list. Example: The mixer transmits the IDs of the currently speaking teleconference participants whose voices have been mixed and used to create an outgoing packet, pointing the recipient to the current message source, even if all audio packets contain the same SSRC ID (like the mixer)
DES (Data Encryption Standard) - data encryption standard
IANA (Internet Assigned Numbers Authority) - Internet Assigned Numbers Community
IMA (Interactive Multimedia Association) - Interactive Multimedia Association
IP (Internet Protocol) - between network protocol, network layer protocol, datagram protocol. Allows packets to cross multiple networks en route to their destination
IPM (IP Multicast) - multicast using the IP protocol
LD-CELP (low-delay code excited linear prediction) - a speech coding algorithm using code excited linear prediction with low delay
LPC (linear predictive encoding) - linear predictive encoding
NTP (Network Time Protocol) is a countdown of time in seconds relative to zero hours on January 1, 1900. The full NTP timestamp format is a 64-bit unsigned fixed-point number with an integer part in the first 32 bits and a fractional part in the last 32 bits. In some cases, a more compact representation is used, in which only the middle 32 bits are taken from the full format: the lower 16 bits of the integer part and the upper 16 bits of the fractional part.
RPE / LTP (residual pulse excitation / long term prediction) - a speech coding algorithm with differential impulse excitation and long-term prediction
RTCP (Real-Time Control Protocol) - real-time data transmission control protocol
RTP (Real-Time Transport Protocol) - real time transport protocol
SSRC (synchronization source) - synchronization source. The source of the RTP packet stream, identified by the 32-bit numeric SSRC identifier that is carried in the RTP header, regardless of the network address. All packets with the same sync source use the same timing and sequence number space, so the receiver groups the packets for playback with the sync source. Synchronization source example: sender of a stream of packets received from a signal source microphone type, camcorder or RTP mixer. The synchronization source may change the data format over time, for example, audio coding... SSRC identifier is a randomly selected value that is considered globally unique within a particular RTP session. The teleconference participant is not required to use the same SSRC for all RTP sessions in a multimedia communication session; SSRC identity aggregation is provided through RTCP. If a participant generates multiple streams in one RTP session, for example, from multiple video cameras, then each stream must be identified by a separate SSRC
TCP (Transmission Control Protocol) is a transport layer protocol used in conjunction with IP
UDP (User Datagram Protocol) is a connectionless transport layer protocol. UDP only allows a packet to be sent to one or more stations on the network. Validation and ensuring the integrity (guaranteed delivery) of data transmission is carried out at a higher level
ADPCM - Adaptive Differential Pulse Code Modulation
jitter - jitter, phase or frequency deviations of the signal; in relation to IP telephony - irregularities in the delay of datagrams in the network
ZPD - data transmission link (second level of the Reference Model of Interaction open systems)
IVS - information and computer networks
mixer - an intermediate system that receives RTP packets from one or more sources, possibly changes the data format, combines the packets into new package RTP and then transmits it. Since many signal sources are generally out of sync, the mixer adjusts the timing of the component streams and generates its own timing for the combined stream. Thus, all information packets generated by the mixer are identified as having the mixer as their sync source.
monitor (monitor) - an application that receives RTCP packets sent by participants in the RTP session, in particular, reception reports, and evaluates the current quality of service to control distribution, detect errors and long-term statistics. Usually the functions of the monitor lie with the applications used in the session, but the monitor can also be a separate application that is not otherwise used, does not send or receive RTP data packets. Such applications are called third party monitors.
ITU-T - International Telecommunication Union Telecommunication Standardization Sector
end system - an application that generates content transmitted in RTP packets and / or which consumes the content of received RTP packets. The end system can act as one or more (but usually only one) clock sources in each RTP session
RTCP packet - A control packet consisting of a fixed portion of the header, similar to the RTP information packets, followed by structural elements that vary depending on the type of RTCP packet. Typically, multiple RTCP packets are sent together as a composite RTCP packet in a single underlying protocol packet; this is provided by the length field in the fixed header of each RTCP packet
RTP packet is a protocol data unit consisting of a fixed RTP header, possibly an empty list of included sources, extensions, and traffic. Usually one lower layer protocol packet contains one RTP packet, but there can be several
port is an abstraction used by transport-level protocols to distinguish between multiple destinations within a single host computer. The port is identified by its number. Thus, the port number is a number that identifies the specific application for which the data being sent is intended. This number, along with information about which protocol (for example, TCP or UDP) is used at the higher layer, is contained among other service information in datagrams sent over the Internet. Transport selectors (TSEL) used by the transport OSI layer, are equivalent to ports
profile (profile) - a set of parameters for the RTP and RTCP protocols for a class of applications, which determines the features of their functioning. The profile defines the use of token bit and traffic type fields in the RTP data packet header, traffic types, RTP data packet header additions, first 16 bits of RTP data packet header extension, RTCP packet types, RTCP reporting interval, SR / RR packet extension, usage SDES packets, services and algorithms for ensuring communication security and specifics of using the lower layer protocol
RTP session (RTP session) - communication of many participants interacting via the RTP protocol. For each participant, the session is determined by a specific pair of destination transport addresses (one network address plus a couple of ports for RTP and RTCP). A pair of transport destination addresses can be common for all participants (as in the case of IPM) or can be different for each (individual network address and a common pair of ports, as in bi-directional communication). In a multimedia session, each type of traffic is sent in a separate RTP session with its own RTCP packets. Multicast RTP sessions are distinguished by different port pair numbers and / or different multicast addresses
non-RTP means — Protocols and mechanisms that may be needed in addition to RTP to provide an acceptable service. Particularly for multimedia conferencing, the conference management application can allocate multicast addresses and encryption keys, negotiate the encryption algorithm to be used, and determine dynamic mappings between RTP traffic type values and the traffic formats they represent (formats that do not have a predefined value. traffic type). For simple applications can also be used Email or conference database
translator - an intermediate system that forwards RTP packets without changing the sync source identifier. Examples of translators: devices that perform transcoding without mixing, multi-way or bi-directional replicators, application layer applications in firewalls
transport address - A combination of network address and port number that identifies the endpoint of the transport layer, such as an IP address and UDP port number. Packets are forwarded from the source transport address to the destination transport address
RTP traffic - multimedia data transmitted in an RTP packet, such as audio samples or compressed video data
PSTN - public switched telephone networks

The most pressing problem is increasingly becoming the lack of address space, which requires a change in the format of the address.

Another problem is the lack of scalability of routing, the foundation of IP networks. The rapid growth of the network causes congestion on routers, which already today have to maintain routing tables with tens or hundreds of thousands of entries, as well as solve the problem of packet fragmentation. It is possible to make the operation of routers easier, in particular, by upgrading the IP protocol.

Along with the introduction of new functions directly into the IP protocol, it is advisable to ensure its closer interaction with new protocols by introducing new fields into the packet header.

As a result, it was decided to upgrade the IP protocol, pursuing the following main goals:

creation of a new extended addressing scheme;
improving network scalability by reducing the functions of backbone routers;
ensuring data protection.

Expanding the address space... The IP protocol solves the potential problem of shortage of addresses by expanding the address width to 128. However, such a significant increase in the length of the address was made largely not to remove the problem of shortage of addresses, but to improve the efficiency of networks based on this protocol. The main goal was to structurally change the addressing system, expand its functionality.

Instead of the existing two levels of the address hierarchy (network number and node number), IPv6 proposes to use four levels, which implies three-level network identification and one level for node identification.

Now the address is written in hexadecimal form, with every four digits separated from each other by a colon, for example:

FEDC: 0A96: 0: 0: 0: 0: 7733: 567A.

For networks that support both versions of the IPv4 and IPv6 protocol, it is possible to use traditional decimal notation for the lower 4 bytes, and hexadecimal for the upper ones:

0: 0: 0: 0: FFFF 194.135.75.104.

Within the IPv6 addressing system, there is also a dedicated address space for local use, that is, for networks outside the Internet. There are two types local addresses: for "flat" networks without subnetting (Link-Local) and for networks divided into subnets (Site-Local), which differ in the prefix value.

Changing the format of packet headers. This can be done by a new organization scheme for "nested headers", which provides the division of the header into the main one, which contains the necessary minimum of information, and additional ones, which may be absent. This approach opens up rich possibilities for extending the protocol by defining new optional headers, making the protocol open.

The basic 40-byte IPv6 datagram header has the following format (Figure 2.4).

Field Traffic Class is equivalent in purpose to the field Type Of Service and the field Hop Limit- field Time To Live IPv4 protocol.

Field Flow Label allows you to isolate and specifically process individual data streams without the need to analyze the contents of the packets. This is very important in terms of reducing the load on the routers.

Field Next Header is analogous to the IPv4 Protocol field and defines the type of header following the main header. Each subsequent additional header also contains a Next Header field.

Control protocol transmission of information (Transmission Control Protocol - TCP) was developed to support interactive communication between computers. The TCP protocol ensures the reliability and reliability of data exchange between processes on computers that are part of a common network.

Unfortunately, TCP is not capable of transmission multimedia information... The main reason is the presence of control over delivery. Monitoring takes too long to transmit more delay-sensitive information. In addition, TCP provides mechanisms to control the transmission rate to avoid network congestion. Audio and video data, however, require strictly defined bit rates that cannot be arbitrarily changed.

On the one hand, TCP interacts with the application protocol of the user application, and on the other, with the protocol that provides "low-level" packet routing and addressing functions, which are usually performed by IP.

The logical structure of network software that implements the protocols of the TCP / IP family in each node of the Internet is shown in Fig. 2.5.

The rectangles represent the modules that process the data, and the lines connecting the rectangles represent the data transfer paths. The horizontal line at the bottom of the figure denotes an Ethernet network, which is used as an example of a physical medium.

Rice. 2.5.

To establish a connection between two processes on different computers network, you need to know not only the Internet addresses of computers, but also the numbers of those TCP ports (sockets) that processes use on these computers. Any TCP connection on the Internet is uniquely identified by two IP addresses and two TCP port numbers.

TCP can handle damaged, lost, duplicated, or out-of-order packets. This is achieved through a mechanism for assigning a sequence number to each transmitted packet and a mechanism for checking the receipt of packets.

When TCP transmits a segment of data, a copy of that data is placed in a retransmission queue and a timer is started to wait for an acknowledgment.

The User Datagram Protocol (UDP) is used to exchange datagrams between processes of computers located in a unified system of computer networks.

UDP is based on IP and provides application processes with transport services that are not much different from those of IP. UDP protocol provides non-guaranteed data delivery, that is, it does not require confirmation of its receipt; in addition, this protocol does not require the establishment of a connection between the source and the receiver of information, that is, between UDP modules.

Real-time transport protocol RTP provides real-time end-to-end transmission of multimedia data such as interactive audio and video. This protocol implements traffic type recognition, packet sequence numbering, time stamping and transmission control.

The action of the RTP protocol is reduced to assigning timestamps to each outgoing packet. On the receiving side, the timestamps of the packets indicate in what sequence and with what delays they should be played back. RTP and RTCP support allows the receiving node to arrange received packets in the correct order, reduce the effect of packet delay jitter on the signal quality, and restore synchronization between audio and video so that incoming information can be correctly listened to and viewed by users.

Note that RTP itself does not have any mechanism to guarantee the timely transmission of data and quality of service but uses the underlying services to provide this. It does not prevent packet out-of-order, but it does not imply that the backbone is completely reliable and forwards packets in the correct order. The sequence numbers included in RTP allow the receiver to reconstruct the sequence of the sender's packets.

RTP supports both bidirectional communication and data transfer to a group of destinations if the multicast is supported by the underlying network. RTP is designed to provide the information required by individual applications, and in most cases is integrated into the operation of the application.

Although RTP is considered a transport layer protocol, it usually runs on top of another transport layer protocol, UDP (User Datagram Protocol). Both protocols contribute to the functionality of the transport layer. It should be noted that RTP and RTCP are independent of the underlying transport and network layers, so RTP / RTCP can be used with other suitable transport protocols.

Protocol data blocks RTP / RTCP are called packets. Packets generated in accordance with the RTP protocol and used to transmit multimedia data are called data packets, and packets generated in accordance with the RTCP protocol and used to transfer service information that is required for reliable operation teleconferences are called control packets. An RTP packet includes a fixed header, an optional variable header extension, and a data field. An RTCP packet begins with a fixed portion (similar to the fixed portion of RTP information packets) followed by variable length building blocks.

In order to make the RTP protocol more flexible and can be used for various applications, some of its parameters are made deliberately undefined, but it provides for the concept of a profile. A profile is a set of parameters for the RTP and RTCP protocols for a specific class of applications, which determines the features of their functioning. The profile defines: the use of individual packet header fields, traffic types, header additions and header extensions, packet types, services and communication security algorithms, specifics of using the lower layer protocol, etc. Each application usually works with only one profile, and the profile type is set by selecting the appropriate application. There is no explicit indication of the profile type by port number, protocol identifier, etc.

Thus, a complete application-specific RTP specification should include additional documents that include a profile description as well as a traffic format description that defines how traffic of a particular type, such as audio or video, will be handled in RTP.

Multicast audio conferencing requires a multi-user multicast address and two ports. In this case, one port is required for the exchange of audio data, and the other is used for RTCP control packets. Multicast address and port information is passed on to prospective participants teleconferences... If secrecy is required, then information and control packets can be encrypted, in which case it must also be generated and distributed key encryption.

The audio conferencing application used by each conference participant sends audio data in small chunks, such as 20 ms. Each chunk of audio data is preceded by an RTP header; the RTP header and data are alternately formed (encapsulated) into a UDP packet. The RTP header indicates what type of audio coding (for example, PCM, ADPCM, or LPC) was used to form the data in the packet. This makes it possible to change the type of coding during the conference, for example, when a new participant appears who uses a communication line with a low bandwidth, or when the network is congested.

On the Internet, as in other packet-switched data networks, packets are sometimes lost and reordered, and also delayed for different times. To counteract these events, the RTP header contains a timestamp and sequence number that allow recipients to re-sync to their original state so that, for example, portions of the audio signal are played by the speaker continuously every 20 ms. This synchronization reconstruction is performed separately and independently for each source of RTP packets in teleconferences... The sequence number can also be used by the receiver to estimate the number of lost packets.

Since the participants teleconferences can enter and leave it during the conference, it is useful to know who is participating in it at the moment and how well the conference participants receive audio data. For this purpose, each instance of the audio application during the conference periodically issues messages on the control port (RTCP port) for applications of all other participants about receiving packets with the indication of their username. The receive message indicates how well the current speaker is heard and can be used to control adaptive encoders. In addition to the username, other identification information for bandwidth control may also be included. When leaving the conference, the site sends an RTCP BYE packet.

If in teleconferences both audio and video signals are used, they are transmitted separately. For the transmission of each type of traffic, independently of the other, the protocol specification introduces the concept of an RTP session. A session is identified by a specific pair of destination transport addresses (one network address plus a pair of ports for RTP and RTCP). Packets for each type of traffic are transmitted using two different pairs of UDP ports and / or multicast addresses. There is no direct RTP connection between the audio and video sessions, except that the user participating in both sessions must use the same canonical name in the RTCP packets for both sessions so that the sessions can be linked.

One reason for this separation is that some conference participants should be allowed to receive only one type of traffic if they wish. Despite the separation, synchronous reproduction of source media (audio and video) can be achieved using timing information that is carried in RTCP packets for both sessions.

Not all sites are always able to receive multimedia data in the same format. Consider a case where participants from the same location are connected via a low speed line to most of the other participants in the conference who have broadband access to the network. Rather than forcing everyone to use a narrower bandwidth and lower quality audio coding, an RTP layer communication facility called a mixer can be placed in a narrow bandwidth area. This mixer resynchronizes incoming audio packets to restore the original 20 ms intervals, mixes these reconstructed audio streams into a single stream, encodes the audio signal for a narrow bandwidth, and transmits the packet stream over a low speed link. In this case, packets can be addressed to one recipient or a group of recipients with different addresses. So that the correct indication can be provided at the receiving endpoints source of messages The RTP header includes a means of identifying the sources involved in the mixed packet for mixers.

Some of the audio conferencing participants may be connected by broadband lines, but may not be reachable through IP multicast (IPM) conferencing. For example, they might be behind an application layer firewall that will not allow any transmission of IP packets. For such cases, not mixers are needed, but other types of RTP-level communications, called translators. Of the two translators, one is installed outside the firewall and from the outside forwards all multicast packets received over the secure connection to the other translator behind the firewall. The translator behind the firewall transmits them again as multicast packets to the multi-user group limited to the site's internal network.

Mixers and translators can be designed for a number of purposes. Example: A video mixer that scales video images of individuals into independent video streams and composites them into a single video stream, simulating a group scene.

All fields of RTP / RTCP packets are transmitted over the network by bytes (octets); the most significant byte is transmitted first. All header field data is aligned according to its length. Octets designated as optional have a value of zero.

Control protocol RTCP (RTCP - Real-Time Control Protocol) is based on periodic packet transmission management to all participants in a communication session using the same distribution mechanism as RTP. The lower layer protocol must provide multiplexing of information and control packets, for example, using different numbers UDP ports. RTCP has four main functions.

The main function is to provide feedback for assessing the quality of data distribution. It is an inherent function of RTCP as a transport protocol, and is associated with flow control and congestion control functions of other transport protocols. Feedback may be directly useful for managing adaptive coding, but experiments with IP multicasting have shown that recipient feedback is also important for diagnosing propagation defects. Sending feedback reports on data ingestion to all participants allows observing problems to evaluate whether they are local or global. With the IPM distribution mechanism for entities such as network service providers, it is also possible to receive feedback information and act as a third-party monitor in diagnosing network problems. This feedback feature is provided by RTCP sender and receiver reports.
RTCP maintains a persistent RTP data source identifier at the transport layer called a "canonical name" (CNAME). Because the SSRC ID can change if a conflict is detected or the program is restarted, recipients need the canonical CNAME to track each contributor. Recipients also require CNAME for mapping the set information flows from a given participant to multiple associated RTP sessions, for example, when synchronizing audio and video signals.
The first two features require all peers to send RTCP packets, so RTP must be rate-throttling to allow peer-to-peer scalability. When sent by each participant teleconferences control packages to all other participants, each can independently estimate the total number of participants.
A fourth, optional, RTCP function must provide session control information (eg, participant identification) that will be reflected in the user interface. This is most likely to be useful in "loosely controlled" sessions where members join and leave a group without ownership control or negotiation.

Functions one through three are required when RTP is used in IP multicast and are recommended in all other cases. RTP application developers are encouraged to avoid bidirectional-only mechanisms that do not scale to accommodate users.

RTP allows an application to automatically scale the representativeness of a communication session from a few participants to several thousand. For example, in audio conferencing, data traffic is essentially self-limiting because only one or two people can talk at a time, and in group distribution, the data rate on any link remains relatively constant, regardless of the number of participants. However, management traffic is not self-limiting. If the reception reports from each participant are sent at a constant rate, then the management traffic will grow linearly with the increase in the number of participants. Therefore, a special mechanism must be provided to reduce the transmission frequency of control packets.

For each session, it is assumed that data traffic meets an aggregated limit, called the bandwidth of the session, that is shared by all participants. This bandwidth can be reserved and limited by the network. The session bandwidth is independent of the media encoding type, but the selection of the encoding type may be limited by the bandwidth of the communication session. The session bandwidth parameter is expected to be provided by the session management application when it invokes the media application, but the media applications may also set a default based on single sender data bandwidth for the encoding type selected for the session.

Bandwidth calculations for control and data traffic are based on the underlying transport and network layer protocols (such as UDP and IP). Data Link Layer (DLC) headers are not taken into account in the calculations, since a packet may be encapsulated with different RPC-level headers as it is transmitted.

Control traffic should be limited to a small and known part of the session bandwidth: small enough so that the main function of the transport protocol - data transmission - is not affected; known so that management traffic can be included in the bandwidth specification given to the protocol resource reservation, and so that each participant can independently calculate their share. It is assumed that the portion of the session bandwidth allocated to RTCP should be set to 5%. All session participants MUST use the same amount of RTCP bandwidth so that the computed control packet interval is the same. Therefore, these constants must be set for each profile.

The algorithm for calculating the interval between sending composite RTCP packets for dividing the bandwidth allocated for control traffic among the participants has the following main characteristics:

senders share at least 1/4 of the bandwidth of control traffic as in sessions with large quantity recipients, but with a small number of senders; as soon as the connection is established, the participants receive the CNAME of the transmitting sites within a short period of time;
It is required that the estimated interval between RTCP packets be at least 5 seconds in order to avoid bursts of RTCP packets exceeding the allowed bandwidth when the number of participants is small and the traffic is not smoothed according to the law of large numbers;
the interval between RTCP packets varies randomly between half and one and a half calculated intervals to avoid unintentional synchronization of all participants. The first RTCP packet sent after joining a session is also delayed randomly (up to half the minimum RTCP interval) if an application is started at multiple sites at the same time, for example, when announcing the start of a session;
to automatically adapt to changes in the amount of control information transmitted, a dynamic estimate of the average size of the composite RTCP packet is calculated using all received and sent packets;
this algorithm can be used for sessions in which packet transmission is valid for all participants. In this case, the session bandwidth parameter is the product of the individual sender's bandwidth by the number of participants, and the RTP bandwidth relies on the underlying protocol to provide an indication of the length. The maximum length of RTP packets is limited only by the underlying protocols.
Multiple RTP packets can be carried in a single lower layer protocol data unit, such as a UDP packet. This reduces the redundancy of headers and simplifies synchronization between different streams.

The rapid growth of the Internet places new demands on the speed and volume of data transfer. Increasing network capacity alone is not enough to satisfy all of these demands; smart and efficient methods of traffic and line congestion management are required.

In real-time applications, the sender generates a stream of data at a constant rate, and the receiver (or receivers) must provide this data to the application at the same rate. Such applications include, for example, audio and video conferencing, live video, remote diagnostics in medicine, computer telephony, distributed interactive simulation, games, real-time monitoring, etc.

The most widely used transport protocol is TCP. While TCP can support a wide variety of distributed applications, it is not suitable for real-time applications.

This task is intended to solve the new real-time transport protocol - RTP(Real-Time Transport Protocol), which guarantees the delivery of data to one or more destinations with a delay within specified limits, that is, the data can be played back in real time.

RTP does not support any kind of packet delivery, transmission fidelity, or connection reliability mechanisms. All these functions are assigned to the transport protocol. RTP runs on top of UDP and can support real-time data transfer between multiple participants in an RTP session.

Note

For each RTP participant, a session is determined by a pair of packet destination transport addresses (one network address - IP and a pair of ports: RTP and RTCP).

RTP packets contain the following fields: sender identifier indicating which of the participants is generating the data, timestamps when the packet was generated so that the data can be played by the receiving side at correct intervals, information about the order of transmission, and information about the nature of the packet content, for example, about type of video encoding (MPEG, Indeo, etc.). The presence of such information makes it possible to estimate the value of the initial delay and the size of the transmission buffer.

Note

In a typical real-time environment, the sender generates packets at a constant rate. They are sent at regular intervals, traversed the network, and received by the recipient, which plays back the data in real time upon receipt. However, due to changes in the latency of the transmission of packets over the network, they may arrive at irregular intervals. To compensate for this effect, incoming packets are buffered, adhered to for a while, and then provided at a constant rate. software generating output. Therefore, for the real-time protocol to function, each packet must contain a timestamp so that the receiver can reproduce the incoming data at the same rate as the sender.

Since RTP defines (and regulates) the format of the payload of the transmitted data, the concept of synchronization is directly related to this, for which the RTP translation engine is partly responsible - the mixer. Receiving streams of RTP packets from one or more sources, the mixer combines them and sends a new stream of RTP packets to one or more recipients. The mixer can simply combine the data as well as change its format, for example, when combining multiple audio sources. Suppose that the new system wants to participate in the session, but its channel to the network does not have enough capacity to support all RTP streams, then the mixer receives all these streams, combines them into one and transfers the last to the new member of the session. When receiving multiple streams, the mixer simply adds the PCM values. The RTP header generated by the mixer includes the identifier of the sender whose data is present in the packet.

A simpler device, a translator, creates one outgoing RTP packet for each incoming RTP packet. This mechanism can change the format of the data in the packet or use a different set of low-level protocols to transfer data from one domain to another. For example, a potential recipient may not be able to process high-speed video used by other participants in the session. The translator converts the video to a lower quality format that requires a lower bit rate.

RTP is used only to transmit user data — usually multicast — to all participants in the session. Together with RTP, the RTCP (Real-time Transport Control Protocol) protocol works, the main task of which is to provide control over the RTP transmission. RTCP uses the same underlying transport protocol as RTP (usually UDP), but a different port number.

RTCP serves several functions:

Providing and monitoring the quality of services and feedback in case of overload. Since RTCP packets are multicast packets, all participants in the session can appreciate how well the other participants are performing and receiving. Sender messages allow recipients to gauge data speed and transmission quality. Recipients' messages contain information about problems they encounter, including packet loss and excessive jitter. Feedback from recipients is also important for diagnosing distribution errors. By analyzing the messages of all participants in the session, the network administrator can determine whether a given problem concerns one participant or is of a general nature. If the sending application concludes that the problem is characteristic of the system as a whole, for example, due to a failure of one of the communication channels, then it can increase the data compression ratio by reducing the quality or even refuse to transmit video - this allows data to be transmitted over the connection low capacity.
Sender identification. RTCP packets contain a standard textual description of the sender. They provide more information about the originator of data packets than a randomly selected sync source ID. In addition, they help the user identify streams from different sessions.
Session size estimation and scaling. To ensure quality of service and feedback to manage congestion, as well as to identify the sender, all participants periodically send RTCP packets. The transmission frequency of these packets decreases as the number of participants increases. With a small number of participants, one RTCP packet is sent at most every 5 seconds. RFC-1889 describes an algorithm whereby participants limit the frequency of RTCP packets based on the total number of participants. The goal is for RTCP traffic to not exceed 5% of the total session traffic.

RTP is a stream-oriented protocol. The RTP packet header was created with the needs of real-time transmission in mind. It contains information about the order of the packets so that the data stream is correctly assembled at the receiving end, and a timestamp for correct frame interleaving during playback and for synchronizing multiple data streams, such as video and audio.

Each RTP packet has a main header as well as possibly additional application-specific fields.

Using TCP as the transport protocol for these applications is not possible for several reasons:

This protocol only allows a connection between two endpoints, therefore, it is not suitable for multicast transmission.
TCP allows for retransmission of lost segments that arrive when the real-time application is no longer waiting for them.
TCP does not have a convenient mechanism for binding timing information to segments — an additional requirement for real-time applications.

Another widely used transport layer protocol, LJDP does not have some of the limitations of TCP, but it also does not provide critical timing information.

While each real-time application may have its own mechanisms to support real-time transmission, they have many similarities, which makes it highly desirable to define a single protocol.

This task is intended to solve the new real-time transport protocol - RTP (Real-time Transport Protocol), which guarantees the delivery of data to one or more recipients with a delay within specified limits, i.e., the data can be played back in real time.

In fig. 1 shows a fixed RTP header that contains a number of fields that identify elements such as packet format, sequence number, sources, boundaries, and payload type. The fixed header may be followed by other fields containing additional information about the data.

Rice. 1. Fixed RTP header.

V(2 bits). Version field. The current version is the second.
R(1 bit). Fill field. This field signals the presence of padding octets at the end of the payload. Padding is used when the application requires the payload to be a multiple of, for example, 32 bits. In this case, the last octet indicates the number of padding octets.
X(1 bit). Header extension field. When this field is specified, another optional header follows the main header used in experimental RTP extensions.
SS(4 bits). Sender count field. This field contains the number of identifiers of the senders whose data is in the packet, with the identifiers themselves following the main header.
M(1 bit). Marker field. The meaning of the marker bit depends on the type of payload. The marker bit is usually used to indicate the boundaries of the data stream. In the case of video, it specifies the end of the frame. In the case of a voice, it specifies the start of speech after a period of silence.
RT(7 bits). Payload type field. This field identifies the payload type and data format, including compression and encryption. In a steady state, the sender uses only one payload type per session, but it can change it in response to changing conditions if signaled by the Real-Time Transport Control Protocol.
Sequence Number(16 bit). Sequence number field. Each source starts numbering packets with a random number, then incremented by one for each sent RTP data packet. This allows you to detect packet loss and determine the order of packets with the same time stamp. Several consecutive packets can have the same time stamp if they are logically generated at the same moment, such as packets belonging to the same video frame.
Timestamp(32 bits). Time stamp field. This field contains the point in time at which the first octet of payload data was generated. The units in which the time is specified in this field depend on the type of payload. The value is determined by the local clock of the sender.
Synchronization Source (SSRC) Identifier(32 bits). Synchronization source identifier field: A randomly generated number that uniquely identifies the source during the session and is independent of the network address. This number plays an important role in the processing of the incoming portion of data from one source.
Contributing source (CSRC) Identifier(32 bits). List of source identifier fields, "mixed" into the main stream, for example, using a mixer. The mixer inserts a whole list of SSRC source IDs that were involved in the construction of this RTP packet. This list is called CSRC. The number of items in the list: from 0 to 15. If the number of participants is more than 15, the first 15 are selected. An example is an audio conference, in which RTP packets are collected speeches of all participants, each with its own SSRC - they form the CSRC list. Moreover, the entire conference has a common SSRC.

The RTCP protocol, like any control protocol, is much more complex both in structure and in the functions it performs (compare, for example, IP and TCP). Although RTCP is based on RTP, it contains many additional fields that it uses to implement its functions.

Resource Reservation Protocol (RSVP), which is currently under consideration by the Internet Engineering Task Force (IETF), addresses the priority issue for latency-sensitive data, as opposed to traditional data where latency is less critical. RSVP allows end systems to reserve network resources to obtain the required quality of service, especially resources for real-time traffic via RTP. RSVP is primarily concerned with routers, although applications at the endpoints also need to know how to use RSVP to reserve the necessary bandwidth for a given class of service or priority level.

RTP, together with other described standards, allows you to successfully transfer video and audio over conventional IP networks. RTP / RTCP / RSVP is a standardized solution for real-time data networks. Its only drawback is that it is only intended for IP networks. However, this limitation is temporary: networks will somehow develop in this direction. This solution promises to solve the problem of transmitting delay-sensitive data over the Internet.

A description of the RTP protocol can be found in RFC-1889.

RTP and RTCP in VoIP

RTP is the main transport protocol in IP telephony networks. RTP (Real Time Protocol) is a real time protocol that was created for the transmission of multimedia (audio, video), encoded and packed in packets, information over IP networks in strict time frames. RTP segments are transmitted over UDP and IP, respectively, at different levels of the OSI model. The use of UDP, which does not guarantee delivery, is associated with the strict timing of real-time media transmission and the inability of TCP to operate in real time. Therefore, despite the loss of a part of the data, the timeliness of delivery is more important in this case.
V general view the distribution by the protocol over the layers of the OSI model is as follows:
Transport layer: RTP over UDP
Network: IP
Channel: Ethernet
Physical: Ethernet
When transmitting multimedia information using the RTP protocol, the following encapsulation is used:

The minimum RTP segment size is 12 bytes. The first two bits define the protocol version. Today RTPv.2 is used. The next P field is also 2 bits long and indicates the presence of padding characters in the data field when segments of the same length are used. The X field determines if the extended header is used. The 4-bit CC field then defines the number of CSRC fields at the end of the RTP header, i.e. the number of sources that form the stream. Then comes the M field - a marker bit used to highlight important data. The next PT field is 7 bits long. Designed to determine the type of payload - the data required for the application. Using the specified code, the application determines the type of multimedia information and the decoding method.
The rest of the header consists of a SequenceNumber field - the sequence number of the segment, which keeps track of the order of packets and their loss; Time Stamp fields - synchronization code indicating the time of the first coded sample in the payload, this mark is used by the synchronization recovery buffers to eliminate quality losses caused by delay variation; SSRC sync source fields are an arbitrary number that distinguishes one RTP session from another to create multiplexing capabilities. After the permanent fixed portion of the RTP header, up to 15 thirty-two bit CSRC fields may be added that identify the data sources.
Let's describe the procedure for establishing an RTP session. The protocol established that traffic different types transmitted in separate communication sessions. To establish a session, it is necessary to define a pair of destination transport addresses i.e. one network address and two ports for RTP and RTCP. So for a video conference, it is necessary to transmit audio and video in different sessions with correspondingly different destination ports. Transmitting different types of traffic using interleaving in the same session could cause the following problems:
- when changing one of the traffic types, it is impossible to determine which parameter in the session needs to be replaced with a new value;
- only one timing interval is used to establish a session, and when transmitting heterogeneous traffic, each type will have its own interval, and they will be different;
- RTP mixer cannot combine interleaved streams of different traffic types into one stream;
- transmission of several types of traffic in one RTP session is impossible for the following reasons: the use of different network paths or the distribution of network resources; receiving a subset of the multimedia data when required, for example, only audio if the video signal has exceeded the available bandwidth; receiver implementations that use separate processes for different types of traffic, while using separate RTP sessions allows for single or multiple process implementations.

However, RTP (Real-time Transport Protocol) and UDP (User Datagram Protocol) do not guarantee quality, i.e. they do not work with QoS (Quality of Service). Therefore, RTP is supported by Real-Time Transport Control Protocol (RTCP), which provides additional information about the state of an RTP session.
RTCP has four main functions:
I. The main purpose of the RTCP protocol is to provide feedback to ensure the quality of data transmission. Feedback can be directly useful in adaptive coding transmission applications. Also, when using IP multicast, it is extremely important for recipients to diagnose errors in the transmission of messages (packets). Sending messages with reception reports allows the sending party to determine the reason for the unsuccessful message transmission, if any.
II. RTCP contains an immutable transport layer identifier for the source RTP, which is called Canonical Name or Canonical Name. Since the SSRC identifier can be changed in the event of collisions (collisions), the receiver needs the Cname value in order to track each of the participants. The recipient also uses Cname to match many data streams from one participant when establishing multiple sessions at the same time, for example, to synchronize audio and video channels when transmitting video with audio.
III. The above two features assume that all session participants are sending RTCP packets, so the transmission rate must be controlled in order for RTP to establish sessions with a large number of users. When each participant sends its control packets to everyone else, any partner can independently determine the total number of participants in the session. This is necessary for calculating the transmission rate of RTCP messages.
IV. This function serves to convey the minimum necessary control information, such as the participant ID, which is used by the graphical user interface. This feature is used for loosely controlled sessions where users enter and exit without proper negotiation of parameters and characteristics. RTCP acts as a convenient channel for contact with all participants, but it does not necessarily support all the communication requirements of an application.
In IP networks using multicast, functions one, two and three are mandatory when using RTP sessions. It is also recommended to use them when transferring in other networks and environments. Today it is recommended that developers of RTP applications use tools that allow working in a multicast mode, and not only in a unique one.
Let's take a look at the RTCP packet format.
The protocol standard defines several types of RTCP packets. RTCP is designed to transfer service information:
sr: Sender report. It is necessary for the statistics of reception and transmission of session participants who are directly active senders;
rr: Recipient report. Required for statistics from participants who are not recipients;
sdes: Describes the source, includes Cname;
bye: Serves to indicate the end (exit) of the session;
app: Application specific functions;
Each RTCP packet consists of a constant part, as for the RTP protocol, which is used by RTP packets, followed by fields that can vary in length depending on the type of packet, but multiples of 32 bits. The alignment requirements and the length field in the fixed portion of the header are introduced to make RTCP packets concatenated. Multiple RTCP packets can be concatenated with each other without introducing any separators in order to obtain a composite RTCP packet that is sent over a low layer transport protocol such as UDP. There is no specific counter for individual RTCP packets, as the low layer protocol will set the total length and determine the end of the composite packet.

The format of the RTCP packet of the sender message is as follows, as shown in the figure above.

RTCP packets are subject to the following validation checks.
- The RTP version field must be equal to 2.
- The data type field of the first RTCP packet in a composite packet shall be SR or RR.
- The filler bit (P) must be zero for the first packet of a composite RTCP packet, since the filler can only be present in the last one.
- The length of the individual RTCP packet fields must add up to the total length of the composite packet.

If one day you have to quickly figure out what VoIP (voice over IP) is and what all these wild abbreviations mean, I hope this tutorial will help. I note right away that the issues of configuring additional types of telephony services (such as call transfer, voice mail, conference calls, etc.) are not considered here.

So, what we will deal with under the cut:

Basic concepts telephony: types of devices, connection diagrams
SIP / SDP / RTP protocol bundle: how it works
How information about pressed buttons is transmitted
How voice and faxes are transmitted
Digital signal processing and sound quality assurance in IP telephony

In general, the scheme for connecting a local subscriber to a telephone provider via a regular telephone line is as follows:

A telephone module with an FXS (Foreign eXchange Subscriber) port is installed on the provider's side (PBX). At home or in the office, a telephone or fax with an FXO (Foreign eXchange Office) port and a dialer module are installed.

By outward appearance the FXS and FXO ports do not differ in any way, these are ordinary 6-pin RJ11 connectors. But with the help of a voltmeter it is very easy to distinguish them - there will always be some voltage on the FXS port: 48/60 V when on-hook, or 6-15 V during a call. On FXO, if not connected to line, voltage is always 0.

To transfer data over a telephone line on the provider side, additional logic is needed, which can be implemented on the SLIC (subscriber line interface circuit) module, and on the subscriber side - using the DAA (Direct Access Arrangement) module.

Nowadays, wireless DECT phones (Digital European Cordless Telecommunications) are quite popular. By design, they are similar to ordinary telephones: they also have an FXO port and a dialer module, but a module has also been added wireless stations and handsets at 1.9 GHz.

Subscribers are connected to the PSTN-network (Public Switched Telephone Network) - a public telephone network, also known as PSTN, PSTN. The PSTN network can be organized using different technologies: ISDN, optics, POTS, Ethernet. A special case of PSTN, when using a regular analog / copper line - POTS (Plain Old Telephone Service) - a simple old telephone system.

With the development of the Internet telephone communications switched to new level... Fixed telephones are used less and less, mainly for business needs. DECT phones are a little more convenient, but are limited by the perimeter of the house. GSM phones are even more convenient, but they are limited within the country (roaming is expensive). But for IP-phones, they are also softphones (SoftPhone), there are no restrictions, except for access to the Internet.

Skype is the most famous example of a softphone. It can do a lot of things, but it has two important drawbacks: closed architecture and wiretapping is known by which authorities. Because of the first, there is no way to create your own telephone micro-network. And because of the second, it is not very pleasant to be spied on, especially during personal and commercial conversations.

Fortunately, there are open protocols for creating your own communication networks with goodies - these are SIP and H.323. There are slightly more softphones based on SIP protocol than on H.323, which can be explained by its relative simplicity and flexibility. But sometimes this flexibility can fit big sticks into the wheels. Both SIP and H.323 use the RTP protocol to transfer media.

Let's consider the basic principles of the SIP protocol in order to understand how two subscribers are connected.

SIP (Session Initiation Protocol) is a protocol for establishing a session (not only a telephone one). It is a text-based protocol over UDP. It is also possible to use SIP over TCP, but these are rare cases.

SDP (Session Description Protocol) is a protocol for negotiating the type of transmitted data (for audio and video, these are codecs and their formats, for faxes - transmission speed and error correction) and their destination addresses (IP and port). It is also a text-based protocol. SDP parameters are transmitted in the body of SIP packets.

RTP (Real-time Transport Protocol) is an audio / video data transfer protocol. It is a binary protocol over UDP.

General structure of SIP packets:

Start-Line: a field indicating the SIP method (command) when requested or the result of executing a SIP method when responding.
Headers: additional information to the Start-Line, formatted as lines containing ATTRIBUTE: VALUE pairs.
Body: binary or text data. Typically used to transfer SDP parameters or messages.

Here is an example of two SIP packets for one common procedure - call setup:

On the left is the contents of the SIP INVITE packet, on the right - the response to it - SIP 200 OK.

The main fields are highlighted with frames:

Method / Request-URI contains SIP method and URI. In the example, a session is established - the INVITE method, a call to the subscriber [email protected]
Status-Code - response code to the previous SIP command. In this example, the command completed successfully - code 200, i.e. subscriber 555 picked up the phone.
Via - the address at which the subscriber 777 is waiting for an answer. For a 200 OK message, this field is copied from the INVITE message.
From / To - The display name and address of the sender and recipient of the message. For a 200 OK message, this field is copied from the INVITE message.
Cseq contains the sequence number of the command and the name of the method to which it belongs this message... For a 200 OK message, this field is copied from the INVITE message.
Content-Type is the type of data that is transmitted in the Body block, in this case, SDP data.
Connection Information - IP address to which the second subscriber needs to send RTP packets (or UDPTL packets in case of fax transmission over T.38).
Media Description - the port to which the second subscriber should transmit the specified data. In this case, this is sound (audio RTP / AVP) and a list of supported data types - PCMU, PCMA, GSM codecs and DTMF signals.

An SDP message consists of lines containing FIELD = VALUE pairs. Of the main fields, you can note:

o- Origin, the name of the session organizer and the session ID.
With- Connection Information, field described earlier.
m- Media Description, field described earlier.
a- media attributes, specify the format of the transmitted data. For example, indicate the direction of sound - reception or transmission (sendrecv), for codecs indicate the sampling rate and the reference number (rtpmap).

RTP packets contain audio / video data encoded in a specific format. This format specified in the PT (payload type) field. Value correspondence table of this field specific format, see https: // wikipedia org wiki RTP audio video profile.

Also in RTP packets, a unique SSRC identifier (determines the source of the RTP stream) and a timestamp (timestamp, used for uniform playback of sound or video) are indicated.

An example of interaction between two SIP subscribers via a SIP server (Asterisk):

As soon as the SIP phone starts up, the first thing it registers on the remote server (SIP Registar), sends it a SIP REGISTER message.

When a subscriber is called, a SIP INVITE message is sent, the body of which contains an SDP message, which specifies the audio / video transmission parameters (which codecs are supported, to which IP and port to send sound, etc.).

When the remote subscriber picks up the phone, we receive a SIP 200 OK message also with SDP parameters, only the remote subscriber. Using the sent and received SDP parameters, you can establish an RTP audio / video transmission session or a T.38 fax transmission session.

If the received SDP parameters did not suit us, or the intermediate SIP server decided not to pass RTP traffic through itself, then the SDP renegotiation procedure, the so-called REINVITE, is performed. By the way, it is precisely because of this procedure that free SIP proxy servers have one drawback - if both subscribers are in the same local network, and the proxy server is behind NAT, then after redirecting RTP traffic, none of the subscribers will hear another.

After the end of the conversation, the subscriber who hangs up the phone sends a SIP BYE message.

Sometimes after establishing a session, during a conversation, you need access to additional types of services (ADO) - call hold, transfer, voice mail, etc. - which react to certain combinations of pressed buttons.

So, in a regular telephone line, there are two ways to dial a number:

Pulse - historically the first, was used mainly in telephones with a rotary dialer. Dialing occurs due to sequential closing and opening of the telephone line according to the dialed digit.
Tone - dialing a number with DTMF codes (Dual-Tone Multi-Frequency) - each phone button has its own combination of two sinusoidal signals (tones). By executing Goertzel's algorithm, it is quite easy to determine the pressed button.

During a call, the pulse method is inconvenient for transmitting the pressed button. So, it takes about 1 second to transmit "0" (10 pulses of 100 ms: 60 ms - line break, 40 ms - line closure) plus 200 ms for a pause between digits. In addition, characteristic clicks will often be heard during pulse dialing. Therefore, in ordinary telephony, only the tone mode of access to the VAS is used.

In VoIP telephony, information about pressed buttons can be transmitted in three ways:

DTMF Inband - generation of an audio tone and its transmission within audio data (current RTP channel) - this is a normal tone dialing.
RFC2833 - a special RTP packet telephone-event is generated, which contains information about the pressed key, volume and duration. The number of the RTP format in which DTMF RFC2833 packets will be transmitted is specified in the body of the SDP message. For example: a = rtpmap: 98 telephone-event / 8000.
SIP INFO - a SIP INFO packet is formed with information about the pressed key, volume and duration.

DTMF transmission within audio data (Inband) has several drawbacks - these are overhead resources when generating / embedding tones and during their detection, limitations of some codecs that can distort DTMF codes, and poor reliability during transmission (if some of the packets are lost, then detection can occur double pressing of the same key).

The main difference between DTMF RFC2833 and SIP INFO: if the SIP proxy server enables RTP transmission directly between subscribers bypassing the server itself (for example, canreinvite = yes in asterisk), then the server will not notice RFC2833 packets, as a result of which they become unavailable services FEB. SIP packets are always transmitted through SIP proxy servers, so VAS will always work.

As already mentioned, the RTP protocol is used to transfer media data. RTP packets always indicate the format of the transmitted data (codec).

There are many different codecs for voice transmission, with different bitrate / quality / complexity ratios, there are open and closed ones. Any softphone definitely has support for G.711 alaw / ulaw codecs, their implementation is very simple, the sound quality is not bad, but they require a bandwidth of 64 kbps. For example, the G.729 codec requires only 8 kbps, but it is very processor intensive, and besides, it is not free.

For fax transmission, either the G.711 codec or the T.38 protocol is usually used. G.711 fax transmission is the same as T.30 fax transmission, as if the fax was sent over a regular telephone line, but at the same time analog signal from the line is digitized according to the alaw / ulaw-law. This is also called Inband T.30 fax transmission.

T.30 faxes negotiate their parameters: transmission speed, datagram size, error correction type. The T.38 protocol is based on the T.30 protocol, but unlike Inband transmission, the generated and received T.30 commands are analyzed. In this way, not raw data is transmitted, but recognized fax control commands.

To transmit T.38 commands, the UDPTL protocol is used, it is a UDP-based protocol, it is used only for T.38. To transfer T.38 commands, you can still use the TCP and RTP protocols, but they are used much less often.

The main advantages of T.38 are reduced network load and greater reliability compared to Inband fax transmission.

The procedure for sending a fax in T.38 mode is as follows:

A regular voice connection is established using any codec.
When paper is loaded in the sending fax, it periodically sends a T.30 CNG (Calling Tone) signal to indicate that it is ready to send the fax.
On the receiving side, a T.30 CED (Called Terminal Identification) signal is generated - this is the readiness to receive a fax. This signal is sent either after pressing the "Receive fax" button, or the fax does it automatically.
On the sending side, the CED signal is detected and the SIP REINVITE procedure occurs, and the T.38 type is indicated in the SDP message: m = image 39164 udptl t38.

It is desirable to transmit faxes over the Internet in T.38. If the fax needs to be transmitted within the office or between objects with a stable connection, then Inband T.30 fax transmission can be used. In this case, before sending a fax, the echo cancellation procedure must be disabled so as not to introduce additional distortions.

The book Fax, Modem, and Text for IP Telephony, by David Hanes and Gonzalo Salgueiro, is very detailed about fax transmission.

Synchronization Source (SSRC) Identifier

Contributing Source (CSRC) Identifiers

We figured out the protocols for establishing a conversation session (SIP / SDP) and the method of transmitting audio over the RTP channel. There is one important issue left - the sound quality. On the one hand, the sound quality is determined by the selected codec. But on the other hand, additional DSP procedures (DSP - digital signal processing) are still needed. These procedures take into account the peculiarities of VoIP telephony: a high-quality headset is not always used, there are packet drops on the Internet, sometimes packets arrive unevenly, the network bandwidth is also not rubber.

Basic procedures to improve sound quality:

VAD(Voice activity detector) - a procedure for detecting frames that contain voice (active voice frame) or silence (inactive voice frame). This separation can significantly reduce the load on the network, since the transmission of information about the silence requires much less data (it is enough just to transmit the noise level or not transmit anything at all).

Some codecs already contain VAD procedures (GSM, G.729), for others (G.711, G.722, G.726) they need to be implemented.

If VAD is configured to transmit noise level information, then special SID packets (Silence Insertion Descriptor) are transmitted in 13m RTP CN (Comfort Noise) format.

It is worth noting that SID packets can be dropped by SIP proxy servers, therefore, for verification, it is advisable to configure the transmission of RTP traffic past SIP servers.

CNG(comfort noise generation) - a procedure for generating comfortable noise based on information from SID packets. Thus, VAD and CNG work in conjunction, but the CNG procedure is much less in demand, since it is not always possible to notice the operation of the CNG, especially at low volume.

PLC(packet loss concealment) - the procedure for recovering an audio stream in case of packet loss. Even with 50% packet loss, a good PLC algorithm achieves acceptable speech quality. There will be distortions, of course, but the words can be made out.

The easiest way to emulate packet loss (on Linux) is to use the tc utility from the iproute package with the netem module. It only shapes outgoing traffic.

An example of starting a network emulation with 50% packet loss:

Tc qdisc change dev eth1 root netem loss 50%

Disable emulation:

Tc qdisc del dev eth1 root

Jitter buffer- a procedure for getting rid of the jitter effect, when the interval between received packets varies greatly, and which, in the worst case, leads to an incorrect order of received packets. Also, this effect leads to interruptions in speech. To eliminate the jitter effect, it is necessary to implement a packet buffer on the receiving side with a size sufficient to restore the original order of sending packets with a given interval.

You can also emulate the jitter effect using the tc utility (the interval between the expected moment of packet arrival and the actual moment can be up to 500 ms):

tc qdisc add dev eth1 root netem delay 500ms reorder 99%

LEC(Line Echo Canceller) - Local echo cancellation procedure when the far site begins to hear his own voice. Its essence is to subtract the received signal from the transmitted signal with a certain coefficient.

Echoes can occur for several reasons:

acoustic echo due to poor-quality audio path (sound from the speaker enters the microphone);
electrical echo due to impedance mismatch between telephone set and the SLIC module. Most of the time this occurs in a 4-wire to 2-wire telephone line.

It is not difficult to find out the reason (acoustic or electrical echo): the subscriber on whose side the echo is generated must turn off the microphone. If the echo appears anyway, then it is electrical.

For more information on VoIP and DSP procedures, see the book VoIP Voice and Fax Signal Processing. A preview is available on Google Books.

On this superficial theoretical VoIP overview completed. If you are interested, then an example of practical implementation of a mini-PBX on a real hardware platform can be considered in the next article.

[!?] Questions and comments are welcome. They will be answered by the author of the article Dmitry Valento, software engineer at the Promwad electronics design center.

Tags:

for beginners
for newbies

Add tags