paulgorman.org/technical

Linux Traffic Control

(January 2019)

Traffic control sets which network packets to accept and at what rates. By default traffic flows into a single queue, and leaves that queue on a first-in-first-out basis. More complex schemes are possible, addressing problems like fair sharing of limited bandwidth or prioritizing latency-sensitive traffic (e.g., VoIP).

The term “QoS” (quality of service) is a synonym of “traffic control”.

tc is the user-space utility to control the Linux kernel packet scheduler. Most Linux distros include tc with their iproute2 package. See tc(8).

A commented example of adding a qdisc using tc:

$  tc qdisc add     $(# Add a queuing discipline )             \
      dev vti0      $(# Attach qdisc to this device )          \
      root          $(# Apply to egress )                      \
      handle 1:0    $(# Name qdisk with major:minor numbers )  \
      htb           $(# Apply HTB queuing discipline )

Note that (as of 2019) tc-tbf(8) still discusses burst in terms of “timer tick” and HZ. However, modern Linux is tickless, with high-resolution timers. Therefore, the optimum setting of burst depends mostly on application. Burst is the size in bytes of the TBF bucket — i.e., the number of bytes to pass through the queue at once, at full device speed before the rate limit kicks in.

Concepts

Queues

A queue is a buffer that holds a finite number of packets waiting to be sent. A queue only becomes interesting for traffic control when we can delay, rearrange, drop, and prioritize queued packets, or even shift packets to other queues with different properties.

Flows

A flow is a unique set of packets sent from one particular source IP address to a particular destination IP address. Traffic control mechanisms often classify packets into flows, which can be managed in aggregate.

Tokens and buckets

To control the number or packets or bytes dequeued by setting timers and counting is expensive and complex. A cheaper, simpler method generates tokens at a fixed rate, and only dequeues packets when a token is available.

Imagine people waiting in line for an amusement park ride. The ride runs on rails, and a car arrives every minute to pick up another rider. The cart represents a token. The rider represents a packet.

A bucket is like a small, connected train of cars that can pick up several riders every minute. If the not enough people are waiting to fill the train, it will depart on schedule anyhow. A bucket holds multiple tokens.

The qdisc used in Linux traffic control is a “token bucket filter” (TBF). It transmits packets to match the available tokens, but defers any packets that exceed the number of tokens.

Packets and frames

Chunks of data in layer 2 are “frames”. Ethernet, for example, sends frames.

Chunks of data in layer 3 are “packets”. IP, for example, sends packets.

Discussions of traffic control generally call all chunks of data “packets”, despite “frame” sometimes being the more correct term.

Elements Common to Traffic Control Systems

Shaping

Shapers delay output of packets according to a set rate.

Scheduling

Schedulers arrange/rearrange packets for dequeuing.

FIFO is the simplest scheduler. A “fair queuing” scheduler (e.g., SFQ) tries to keep any one client/flow from dominating. A round-robin scheduler (e.g., WRR) gives each flow a turn to dequeue packets.

Classifying

Classifiers sort packets into different queues, and optionally mark them. Classifiers may work together with policers.

Linux traffic control can cascade a packet through a series of classifiers.

Policing

Policers limit traffic in a particular queue, usually to restrain a peer to a certain allocated bandwidth. Excess traffic might be dropped or (better) reclassified.

Unlike shaping, which can delay a packet, policing is an A/B decision — either enqueue the packet or take action B.

Dropping

Dropping discards a packet.

Marking

Marking alters a packet to install a DSCP on the packet, which other routers in the domain can read and use.

This is not the same thing as iptables/Netfilter marks, which only affect packet metadata.

Components of the Linux Taffic Control Implementation

qdisc

A qdisc is a scheduler. “qdisc” means “queuing discipline”.

These can be either classful qdiscs or classless qdiscs. Classful qdisks contain classes and provide handles to attach filters. Classless qdisks contain no classes or filters.

Confusingly, the terms “root qdisc” and “ingress qdisc” are not qdiscs in the sense we mean here — they’re just a pair of hooks that come with each interface. We can attach qdiscs to these — most commonly to the “root qdisc”, which corresponds to egress traffic.

We can attach qdiscs with classes or filters to the root qdisc. The ingress qdisc only accepts qdiscs with filters, not classes. So, ingress is more limited than root.

class

A class contains either several child classes or one child qdisc.

We can attach any number of filters to a class too, which send traffic into a child class or reclassify/drop it.

filter

A filter combines a classifier and a policer.

A filter can be attached to a qdisc or a class.

A packet always gets screened through the filter attached to the root qdisc first, before it can be directed to any subclasses.

classifier

A classifier is one component of a filter. A classifier identifies a packet based on the packet’s characteristics or metadata. We manipulate a classifier with tc.

policer

A policer is used as part of a filter. A policer sets a threshold, and takes one action for traffic rates above that threshold, and another action for traffic rates below that threshold.

A policer never delays traffic.

drop

Dropping a packet. A drop only happens as a decision by a policer attached to a filter.

However, a drop might also happen as a side-effect. A shaper or scheduler might cause a traffic drop if a buffer fills during especially bursty traffic.

handle

A handle uniquely identifies a class or classful qdisc in the traffic control structure. The handle has two parts: a major number and a minor number. The major and minor numbers may be assigned arbitrarily by the user, but:

Making Lossy/Jittery Interfaces for Fun and Testing

Make a thing we can ping that always has packet loss! Note that netem alterations only work on output, which is why we can’t just qdisc some dummy interfaces in the namespace. See tc-netem(8).

#!/bin/sh
set -euf

# Create troubled interfaces to test network monitoring tools.
# Run with `sudo`.

ip netns add ns-trouble

ip link add veth0-trouble type veth peer name veth1-trouble
ip link set veth1-trouble netns ns-trouble
ip addr add 172.20.0.1/12 dev veth0-trouble
ip link set veth0-trouble up

ip link add veth2-trouble type veth peer name veth3-trouble
ip link set veth3-trouble netns ns-trouble
ip addr add 172.20.0.3/12 dev veth2-trouble
ip link set veth2-trouble up

ip netns exec ns-trouble ip addr add 172.20.0.2/12 dev veth1-trouble
ip netns exec ns-trouble ip link set veth1-trouble up

ip netns exec ns-trouble ip addr add 172.20.0.4/12 dev veth3-trouble
ip netns exec ns-trouble ip link set veth3-trouble up

ip netns exec ns-trouble ip route add default via 172.20.0.2

ip route add 172.20.0.10 via 172.20.0.1
ip route add 172.20.0.20 via 172.20.0.3

ip netns exec ns-trouble ip link add lossy0 type dummy
ip netns exec ns-trouble ip link set dev lossy0 up
ip netns exec ns-trouble ip addr add 172.20.0.10/12 dev lossy0
tc qdisc add dev veth0-trouble root netem loss 30% 25% delay 3ms 30ms

ip netns exec ns-trouble ip link add latency0 type dummy
ip netns exec ns-trouble ip link set dev latency0 up
ip netns exec ns-trouble ip addr add 172.20.0.20/12 dev latency0
tc qdisc add dev veth2-trouble root netem delay 50ms 500ms

# To clean up, run:
#    sudo ip netns del ns-trouble
🐚 ~ $ ping -c 5 172.20.0.10
PING 172.20.0.10 (172.20.0.10) 56(84) bytes of data.
64 bytes from 172.20.0.10: icmp_seq=1 ttl=64 time=0.091 ms
64 bytes from 172.20.0.10: icmp_seq=4 ttl=64 time=0.047 ms

--- 172.20.0.10 ping statistics ---
5 packets transmitted, 2 received, 60% packet loss, time 88ms
rtt min/avg/max/mdev = 0.047/0.069/0.091/0.022 ms
🐚 ~ $ ping -c 5 172.20.0.20
PING 172.20.0.20 (172.20.0.20) 56(84) bytes of data.
64 bytes from 172.20.0.20: icmp_seq=1 ttl=64 time=647 ms
64 bytes from 172.20.0.20: icmp_seq=2 ttl=64 time=16.7 ms
64 bytes from 172.20.0.20: icmp_seq=3 ttl=64 time=0.060 ms
64 bytes from 172.20.0.20: icmp_seq=4 ttl=64 time=881 ms
64 bytes from 172.20.0.20: icmp_seq=5 ttl=64 time=0.051 ms

--- 172.20.0.20 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 16ms
rtt min/avg/max/mdev = 0.051/308.972/880.818/378.855 ms