paulgorman.org/technical

Docker

(July 2017, Feb 2018)

Docker is a container system. Docker containers are single-app rather than full system containers. In fact a Docker container more closely resembles a single, isolated process (albeit ones that may spawn child processes) than traditional virtual machines. A container might spin up only to handle one or a handful of requests before exiting. Docker best matches applications that either are stateless or keep their state in an external store, like a database outside Docker.

Docker containers are light-weight, and operate using standard Linux technologies like cgroups and namespaces. Beyond neatly bundling existing technologies, Docker adds a powerful API for container administration.

Docker has a Docker Server that hosts containers. One Docker server/daemon instance runs on a box, and manages multiple containers. The same binary provides both server and client.† The client sends commands to the server. Optionally, a third component, the Docker Registry, stores images and metadata. A registry may service multiple Docker daemons on multiple hosts. There is an central public Docker registry, but also supports private registries.

The Docker client speaks to the server in one of three ways: via Unix socket, via unencrypted TCP (port 2375), or encrypted TCP (port 2376). On systemd boxes, the default is a Unix socket, but one controlled by systemd socket activation [1] [2] (try systemctl status docker.socket).

Docker images use a “union” file system that starts each container instance from a known state. I.e., changes that occur in a container spun up from a master image do not affect new containers that use the same base. When a running container modifies a file, the change affects only a read-write layer that sits on top of and masks the base image. A Docker image may be comprised of numerous stacked (mostly read-only) layers that nevertheless present a coherent image to an instance that uses it. To an instance, the file system appears writable, but changes don’t affect the underlying layers. The union file system uses copy-on-write for efficiency.

So these are the core Docker things:

† Although one binary provides all functionality, various manual pages split the documentation. The dockerd(8) page covers daemon mode, for example, while the docker-run(1) page covers running containers.

Installation

The Docker project provides packages and good directions for several Linux distributions, including Debian and CentOS.

Verify the installation:

# docker run hello-world
# docker ps -a
# docker images -a

# docker pull fedora
# docker run -it fedora /bin/bash
[root@2ab31fa5597a /]# dnf update
[root@2ab31fa5597a /]# dnf install asterisk

# docker run -it alpine ash

# docker run -it --rm busybox:glibc

For development work, it’s handy to add your user account to the docker group:

#  usermod -a -G docker paulgorman

The Docker install fails, and journalctl -xe shows:

dockerd[18376]: Error starting daemon: Error initializing network controller: list bridge addresses failed: no available network

Fix this by adding a bridge for Docker:

#  ip link add name docker0 type bridge
#  ip addr add dev docker0 172.17.0.1/16
#  install docker-ce

The Docker daemon creates a number of iptables rules. It also sets the default policy on the FORWARD chain to DENY. If we, for example, want KVM guests to have unfettered access to br0, do something like:

#  iptables -I FORWARD -i br0 -o br0 -j ACCEPT

Tooling

A lot of auxiliary and third-party tooling is available, much of it focusing on orchestration and cluster management. The docker binary itself provides core container management tooling in two ways:

Networking

By default, Docker isolates the containers to a bridge on the host called docker0. Containers on the same host can talk to each other, but not to the external/real network. Forward host ports into the bridge to connect containers to the outside.

See the --bridge flag in dockerd(8) to specify a non-default bridge, or to disable networking altogether. Disable inter-container communication with --icc=false option.

Persistent Data

Transience and consistency is a selling point of containers, certainly of Docker containers. However, sometimes we need to persist data beyond one run of the container, or to share changes between running instances.

Through the years, Docker has offered various solutions for persistence. Initially, these included injecting data into the container at launch or mounting volumes over NFS. Those solutions proved inadequate. Docker 1.8 and earlier advocated “data only containers” — barebones containers that do nothing besides exposing a data volume. Docker 1.9 and above shifts the recommendation to the “volume” API:

$ docker volume create --name my_data
$ docker run -d -v my_data:/container/path/for/volume container_image my_command
$ docker volume ls
$ docker volume inspect volume_name
$ docker volume ls -f dangling=true
$ docker volume rm my_unwanted_data

Data volumes avoid the copy-on-write mechanism of regular containers, expose the data for management by the host, and can be accessed by multiple instances. The data exists outside Docker’s union file system. Note that this may cause challenges, e.g., file locking.

Another aspect of persistence is application configuration data. Ideally, a Dockerized application gets all its configuration in the form of environment variables passed to it by the container.

Note that, because of the overlay filesystem, writes inside the container perform poorly, so extensive write are discouraged, even if we don’t care about persisting them.

Logs are not written inside the container. See “Docker Logs” below.

Union File System

Docker containers use a union file system (like union mounts from Plan 9, or a bit like qcow2 sparse images). This makes new container instances very cheap to create — in terms of disk space, a new container might only add a few KB on top of the space used by the underlying image.

Changes to an image accumulate in thin layers, presenting a “union” of the layers as one coherent file system to the container.

Dockerfiles

A Dockerfile specifies how to build a container. See docker-build(1).

FROM fedora:23

MAINTAINER Paul Gorman <paul@example.com>

LABEL "thing"="important note" "another"="reference this later"

ENV astuser

RUN dnf update && dnf install asterisk && dnf clean all

ADD ./*.conf /etc/asterisk/

EXPOSE 5060-5061/tcp
EXPOSE 10000-20000/udp

USER $astuser

CMD ["/usr/sbin/asterisk"]

Assuming ‘Dockerfile’ is in our current directory, build the container with:

$  docker build --tag "my_build" .

By default, Docker runs process in the container as “root”, unless changed with the “USER” instruction. Don’t run productions containers as “root”. (Even though the container provides some isolation, the container still uses the host kernel, where we don’t want it mucking around as root!)

Because each command in the dockerfile adds a layer to the union filesystem, it’s good to combine lines like the dnf commands above. If the setup instructions are extensive enough to create a lot of layers, consider pulling down a shell script with ADD and feeding it to RUN instead of including all the setup directly in the dockerfile.

The CMD instruction sets what gets run when the container starts. The dockerfile only contains one CMD (or, anyhow, only the last one actually happens).

Run our newly-built container:

$  docker run -d -p 8080:8080 my_build

docker run is a convenience wrapper that masks two command: first docker create to construct the image, and then docker start to start the container.

By default, Docker gives the container a name like peaceful_blackwell (i.e., adjective_famousname). Override the default like docker run --name "mycontainer".

“Tags” name image builds. “Names” name container instances. Container names must be unique per Docker host.

Labels let us apply arbitrary key/value metadata to images or individual containers. Add labels during image creation and/or at container runtime. See the labels on a container with docker inspect peaceful_blackwell. Search containers for labels like:

$  docker ps -a -f label=deployer=Paul

Building a Docker Image

To build an image, feed a dockerfile to the docker tool with the build flag. Each command in the dockerfile generates an additional layer on top of the image, so it’s easy to understand how Docker composes the image.

See docker-build(1).

Managing Images and Containers

Delete an unwanted container, then remove the image for that container:

# docker images -a
REPOSITORY               TAG                 IMAGE ID            CREATED             SIZE
rocketchat/rocket.chat   latest              2cd49c2c326d        8 days ago          769MB
busybox                  glibc               2bbf44aed9f8        7 weeks ago         4.4MB
busybox                  latest              5b0d59026729        7 weeks ago         1.15MB
alpine                   latest              3fd9065eaf02        2 months ago        4.14MB
hello-world              latest              f2a91732366c        3 months ago        1.85kB
# docker rmi rocketchat/rocket.chat
Error response from daemon: conflict: unable to remove repository reference "rocketchat/rocket.chat" (must force) - container df3f9b43f8c7 is using its referenced image 2cd49c2c326d
# docker rm df3f9b43f8c7
df3f9b43f8c7
# docker rmi rocketchat/rocket.chat
Untagged: rocketchat/rocket.chat:latest
Untagged: rocketchat/rocket.chat@sha256:061dcb056431eccc6f7dce1e7ea400ccd31278dea2181c558b9c891bf3f0e141
Deleted: sha256:2cd49c2c326d8361fb8333db65e9bd0c551fb36ae3b64e2d8e534da8f5a4aafd
Deleted: sha256:d2ae5a7ae8b0a9526e20fbd8a4956ceffe5396934306d2f2736dcb3706eb327b
Deleted: sha256:7f11c62ba8995a6b7692fb3ff3a501984b068d2b454e409ff564b390ffb81903
Deleted: sha256:04f266c56df2117c1ecfad0c32711484c540f6949dace10dbb1e7fe5e8040c71
Deleted: sha256:e0f2864d8ad8234bf233bd3848be65e7a7358f2cfb3cd7e2792ca2c4c6aefc6f
Deleted: sha256:4bcdffd70da292293d059d2435c7056711fab2655f8b74f48ad0abe042b63687

Docker Logs

Docker logs anything written to STDOUT or STDERR from inside a container. The logging method is configurable. By default, Docker logs to a per-container JSON file.

$  docker logs --since 1h 1bd4c783ad93
$  docker logs --follow 1bd4c783ad93

Docker saves the JSON files in /var/lib/docker/containers/mycontainer/. With long-running or chatty containers, the defaults may be inadequate. For example, Docker does not rotate logs by default, although docker run has the options --log-opt max-size and --log-opt max-file.

Other supported logging mechanisms include syslog and journald. See --log-driver in docker-run(1).

Q & A

Where does Docker store stuff?

Mostly in /var/lib/docker/.

And see:

$  docker container ls

How is Docker itself configured? How do we set where it listens for client connections?

The daemon gets most (all?) of its config as command-line arguments. On systemd boxes, the command invocation happens in the service file. If we want to customize the service (on Debian):

#  cp /lib/systemd/system/docker.service /etc/systemd/system/
#  vim /etc/systemd/system/docker.service
#  systemctl daemon-reload
#  systemctl restart docker.service

The default docker.service file only sets the method of client communication. With -H fd://, Docker expects the process that spawned it (i.e., systemd) to hand it an already-activated socket.

How does a container know which DNS server to use, etc.?

When a container starts, Docker copies various files from the host (hostname, hosts, resolv.conf) to /var/lib/docker/containers/mycontainer/, then bind mounts them into the container. Override or augment this behavior with arguments to docker run--hostname, --dns, --dns-search, --add-host.

How do we constrain resources used by a container?

Docker allocates CPU in terms of “shares”, with 1024 total shares representing the whole available pool of CPU. A container allocated 512 shares can use half the total CPU resources, for example. Configure this with the --cpu-shares argument to docker run.

Constrain memory with -m, like docker run -m 1g …. This allocates RAM and a matching amount of swap. Set swap separately with --memory-swap.

Constrain IO like --blkio-weight=500. Use a value between 10 and 1000 (default 500).

These constraints are enforced by cgroups.

It’s possible to adjust the constraints of a running container. See docker-update(1).

Will a container automatically restart?

By default, no. Set --restart like --restart="on-failure:3"

What if docker stop doesn’t end a misbehaving container?

$  docker kill 1bd4c783ad93

Just like the system kill, docker-kill can send other signals with --signal=HUP or whatever.

How do we get rid of unwanted containers and images?

docker rm or docker rmi.

What’s up with this container?

$  docker info 75625e1f51a0

But what going on right now? This is like top for running containers:

$  docker stats

How do we open a shell in a running container?

$  docker exec -t -i 75625e1f51a0 /bin/bash

It’s also possible to use nsenter to directly break into the container namespace from the host.

How do we disconnect from a container’s shell without letting the container die?

Ctrl-p Ctrl-q

What if we want multiple containers to share a custom network namespace?

We might expect to create a namespace like ip netns add foo, then run the container like docker run --netns=foo. That doesn’t work.

The next-best thing is to create the namespace like docker network create foo, and then docker run --network=foo. However, ip netns list will not include foo. Why? ip netns list looks for files in /run/netns/. docker network create deletes its files from /run/netns/, so ip netns list isn’t aware of them. We could, if we had any reason to, expose a docker-network namespace by re-linking the /proc/$PID/ns/net file into /run/netns/

It’s also possible to start a container with --network=none and afterwards attach it to a network namespace with a veth pair.

How do we update our existing container?

Use git pull. Grab the new image version, tear down the old container, and spin up a new container with the new image.

$  docker pull theimage
$  docker stop mycontainer
$  docker rm mycontainer
$  docker run -d --restart unless-stopped --name mycontainer theimage

Atomic Hosts

The atomic host concept involves a light-weight container-supervisor OS — a minimal, immutable OS image. The host configuration comes from the network — e.g., by cloud-init and OSTree. To update the host, simply swap out that OS image atomically, and let the new instance pull down its config from the network again.

Project Atomic is a Red Hat-based atomic host project. CoreOS and RancherOS are similar.

http://www.projectatomic.io/

CentOS has Atomic Host builds available as ISO for bare-metal install, Amazon AMI image, and QCOW2 image for KVM.

Many technologies come together in Project Atomic:

cloud-init

Before spinning up our first Atomic host, we need cloud-init in place to handle early initialization of the instance. Cloud-init does things like:

Cloud-init is not magic. Essentially, it creates config files in an ISO image that gets attached to a booting Atomic Host virtual machine.

See https://paulgorman.org/technical/cloud-init.txt.html.


UPDATE: As of 2018, Red Hat acquired CoreOS, leaving the future of how CoreOS will merge with Project Atomic uncertain.

I expect that over the next year or so, Fedora Atomic Host will be replaced by a new thing combining the best from Container Linux and Project Atomic. This new thing will be “Fedora CoreOS” and serve as the upstream to Red Hat CoreOS.

https://lwn.net/Articles/757878/

Project Atomic is an umbrella project consisting of two flavors of Atomic Host (Fedora and CentOS) as well as various other container-related projects. Project Atomic as a project name will be sunset by the end of 2018 with a stronger individual focus on its successful projects such as Buildah and Cockpit.

https://coreos.fedoraproject.org/