I decided it was time I learned about containers. My immediate goal is considering running the Pi-Hole DNS server in a container via docker-pi-hole or manually installing it under LXC. It’s a toy problem, but it’s one I’m considering at the moment. Really it’s an excuse to learn more about containers. I’m a newbie here and there’s a lot of depth I don’t have, but maybe my notes are helpful to someone.
What’s a container? It’s a way to run a program or collection of processes so that the service they implement is relatively isolated from the host operating system and other services running on the host. The isolation is mostly for system administration convenience, a little like we use virtualenv in Python or node_modules in Node to keep libraries separate. Only a lot more complicated than just a bundle of files. Containers package up files, running processes, network access, I/O, device access, etc.
A key reason people like containers is they provide a way to isolate a complex software package in a repeatable lightweight environment. You can hand a new employee a Docker image for their development environment without laborious instructions on how to get set up to work. Or a developer can spin up an environment on their dev machine that’s just like the one in production, run some tests, tear it down, and rebuild it frequently and fast. See this AWS Lambda emulator as an example.
Another key piece of containers is they are lightweight. A VM takes 10-60 seconds to start up (or more); a Docker container can take less than a second. The container doesn’t need much more memory than the processes running natively. And there’s a lot of tooling that makes it easy to build container images and manage them in a space-efficient manner.
Some overview reading:
To understand a thing, first understand what is not that thing.
Virtual machines are not containers. Well, a VM is a container in some sense, but it’s more heavyweight than what people usually call a container. In a container there’s not a whole kernel and guest OS running; the processes are running directly on the host kernel. But those processes are isolated in important ways. A container is more like the traditional Unix security model of running a service as a user with access limited to just some smalls et of files. Or like the venerable chroot, which dates back to v7 Unix in 1979. But the old Unix methods of isolating a service’s processes don’t work very well, so modern containers use a newer set of kernel interfaces to enforce isolation.
AppArmor is also not containers. AppArmor uses Linux Security Modules to grant certain programs a very narrow set of permissions. These are configured with files in /etc/apparmor.d, you can browse Ubuntu’s here. For instance chronyd is limited to things like setting the time, binding a network socket, and writing a few files like logs in specific chronyd directories. If chronyd starts trying to do anything else AppArmor should stop it.) AppArmor has some overlap with container tech (LSM is a bit like seccomp). And some containers (like Docker) use AppArmor as extra security. but AppArmor is intended as a security measure, not a devops tool. (Same goes for SELinux; can be used to help secure a host running containers, but not a core component of the container.)
Container technologies in the Linux kernel
Containers implement isolation with help from the kernel. Unix has always had a modest concept of processes being isolated from each other. Memory isn’t shared by default, user permissions enforce some isolation on who can access files and other resources. But it’s not very secure and it’s not very flexible, so modern containers rely on newer APIs developed in the last 10-15 years. Many of these are fairly Linux specific, although both BSD and SysV were influential in the 90s and 00s developing isolation technologies like BSD Jails.
Namespaces are a Linux kernel feature for completely isolating one set of processes from another. In normal Unix you can see another user’s processes even if you can’t do anything to them. Inside a PID namespace you can’t even see others’ PIDs to try to access them. Same goes for filesystem namespaces, network namespaces, etc. Namespaces are created with the unshare(1) command which uses the unshare(2) system call. See also namespaces(7). You can create separate namespaces for mounted filesystems (a bit like chroot), hostname, IPC, IP network, PIDs, cgroups, and users.
cgroups (“control groups”) are a way to manage resource usage for a group of processes. nice(1) is an example of a very simple form of CPU resource usage. cgroups let you control not only priority (like nice) but also absolute amount of CPU used in a period of time, memory usage, I/O, process creation, device access, etc. The API is quite complicated, I think mostly because it works on groups of processes (and threads). See also the cgroup v2 design doc.
Seccomp-bpf is a way to control what system calls a process makes. System calls are the primary way a process affects the rest of the world; limiting system call access can greatly isolate a set of processes.
capabilities are a different way to restrict what a process can do, based on abstract capability concepts instead of system calls. I mostly think of it as a way to avoid giving a process full root access just to do one simple thing. Instead you can give a process just the ability to bind privileged ports or set the system clock or whatever specific capability it needs.
A word on security
I can’t figure out how secure containers are supposed to be. How hard should it be for a rogue process to break out of a container? The best answer I’ve gotten is “it depends on how secure you made your container”. (Which in practice seems to be “not very“). FWIW I asked on Twitter and pretty much no one considers a Docker container to be a significant security measure.
I looked a bit closer into this with Docker in particular. Docker is so complicated and uses so many different Linux security mechanisms that exploitable bugs are inevitable. And Docker is designed to not be fully isolated or virtualized; most Docker containers deliberately have at least some access to the host system, a network port or part of the filesystem at least. The Docker folks do take security seriously, and when security holes are found they get the full CVE treatment. Still it seems no one considers Docker a full solution for running untrusted code safely.
One big thing I learned is if a process has root privileges inside Docker, then if it escapes the container it very likely will have root outside the container too. There is a way to map root-in-the-container to a non-privileged user outside the container, but it’s complicated and I’m not sure how common it is. Pretty sure it’s not the default.
Bottom line: don’t assume a malicious program inside a container is really trapped inside the container. Too bad, I’d been hoping Docker would be a good way to mitigate the risks of daemons that run as root. It might yet be such a tool but it ain’t simple.
Container implementations and tools
So there’s a bunch of low level kernel support for isolating processes in Linux. How do you use them conveniently? A container implementation bundles all those mechanisms together into one easy-to-use tool. And miraculously, a tool that might even work on something other than Linux (perhaps with a VM to help things along). There are a bunch of container implementations out there, let’s look at a few of them.
Docker is the big player for containers and whole books are written about it. Docker wraps up a bunch of those isolation technologies and makes them easy for an end user to use. Docker is cross platform: you can run other OSes both as Docker guests and as Docker hosts, although often when you look there’s a VM involved in making that possible.
To use Docker you start by getting a Docker image from a repository like Docker Hub or else create your own image. Docker images are created from a specification in a Dockerfile; the image is a collection of files and expected behavior. You then launch the image via Docker’s runC / containerd system, which creates a container. Inside that container environment is whatever the image was specified to do; one simple command line tool, a persistent daemon, a full-fledged operating system image with interactive shells. Docker takes care of managing file persistence so you can shut a container down and restart it again with the same files. Docker’s use of overlay file systems is particularly nice here, you can assemble a container out of a stock Ubuntu file system, an overlay for (say) a database server, and then an extra overlay for your actual database state that’s changing all the time. It’s all quite efficient.
The key thing about Docker is that it’s a pretty solid tool, friendly for developer-users. Also Docker Hub is amazing, it’s a community managed library of operating system images with all sorts of complex and useful things preinstalled. My devops friends also like the way a developer can run a Docker container on their dev machine to test their product, then run basically that same container on production machines to serve live traffic.
Docker’s not the only game in town though. Linux Containers (aka LXC/LXD) are a nice smaller alternative. (Docker started life as an LXC wrapper although it has since evolved.) I liked this tutorial on running Pi-Hole in LXC because it’s so straightforward and simple compared to all the stuff surrounding Docker. OTOH it lacks all the conveniences the Docker ecosystem has built up.
Snap also has my attention because it’s built in to Ubuntu. It’s Canonical’s answer to Docker. I haven’t heard of many people really using Snap but there’s a few hundred Snap-packaged apps in the store (including some that baffle me; why package the command line tool jq as a Snap?) Most of what I’ve read about it is trying to use Snap to solve the “how do I distribute user applications on Linux?” problem. Canonical wants Snap to work across Linux distros. Also I’ve read a claim that Snap is somehow better suited for GUI apps than Docker would be. It’s interesting that Firefox and Slack and the like are available as snaps.
gVisor is an interesting alternative I mention briefly because it takes a different approach. Instead of a bunch of Linux isolation technologies it works by emulating most of the Linux kernel in a user space process. That makes it more like a VM, and also a little like the Windows Subsystem for Linux. I have no idea if anyone uses it, but it’s nice that it drops in to Docker as an alternative for runC.
So there’s a bunch of ways to run a single container. In addition there are complex orchestration software layers out there to help you manage a machine cluster running lots of services in containers. Kubernetes is the one that gets most of the attention right now, there’s also Docker Swarm and Apache Mesos and a bunch of others. I haven’t looked into using any of these but I can say back in the day Google Borg was The Shit, a key part of how that company could operate at the scale they do.
So what about Pi-Hole in a container?
So now that I’ve read several books on yak shaving.. should I run Pi-Hole in a container at home? I think no. The main problem is that at least with Docker, I still have to make changes in my host computer in order to use Pi-Hole on it under the container. And there’s no real reason I can’t just run Pi-Hole directly.
Pi-Hole distributes a nice Docker image I got up and running. But it’s a bit tricky. First, Pi-Hole wants to own port 53/TCP and 53/UDP on your host, to provide DNS services. Fair enough but by default in Ubuntu systemd-resolvd already laid claim to it and gets in the way because of the strange way Ubuntu machines do DNS. So you have to modify the host network config to make it available, which sort of obviates the whole “I’m running Docker so I don’t have to make changes on my host machine” argument. Pi-Hole also wants 80/TCP and 443/TCP; not just for the admin page but also to redirect ad HTTP requests to something that returns an empty page. Again same problem, I already have a web server.
Bottom line; the Pi-Hole service wants to do stuff in the host’s network namespace, and that’s all a bit awkward since the whole point of Docker is to contain things. I’m out of my depth here but it’s a shame there’s not some way to just publish the services on a second IP address for the container, like the docker0 network device listening on 172.17.0.1. Actually that may be entirely possible, but then I have to convince my whole home network where it can find 172.17.0.1 and that’s more routing infrastructure than I have.
On top of it being awkward, Docker does introduce a fair amount of overhead. There’s the docker daemon itself (1 gigabyte of RAM mapped! Although only 100MB used.) Also containerd. Those are more daemons to fail, or have security holes, or just to have to know about.
Still Docker does all more or less work. And the overhead wouldn’t be such a big deal if I were using Docker for other things; installing various versions of PHP, say. And if I didn’t quite trust Pi-Hole to be well behaved Docker would be a more appealing choice. I kind of like the idea of an OS where every user process runs in something like a container. We’re not quite there yet.
I’m glad I did all this learning about containers! They’re neat! But they mostly solve a complexity problem I don’t really have in my little world. I still get by OK hand-managing a single Linux server and making sure my stuff is all installed in separate directories as systemd services or whatever. I definitely see where Docker is a huge help though when coordinating with people or reusing stuff in a more composed way.