CentOS7 systemd not booting on LXD or docker containers on Ubuntu Impish

Marcos Dione

2021-11-20 20:36

A few days after it came out, I upgraded from (K)Ubuntu 21.04 to 21.10. Yes, I am such a junkie.

It all seemed fine, at least desktop-wise, until I had to run some tests for the thing I develop at $WORK. See, we use VM-like containers on top of both Docker and LXD. The images that these build upon are some variation of CentOS7.x be cause that's what we use as our product's OS. When I say VM-like, I mean that inside runs a mostly full OS from systemd and up, not just a single process, which is at least what Docker recommends. If you didn't guessed from the title, the issue was that the containers were not booting properly.

To complicate things, we actually have a quite deep stack, which we try to keep as stable as possible. On top of it all, we have a Pyhon+bash layer that allows us to describe things in terms of our product. Right below this we have Terraform, and to talk to LXD we have, of course, terraform-provider-lxd. LXD itself seems to be a mix of a CLI, an API server, and maybe LXC underneath? And since there are no more proper .debs anymore, you can only install it through snap, which also adds a layer of complexity. This means two things: pinning to a version is not automatically possible because versions disappear from the snap repositories, meaning that deploying new runner nodes becomes impossible once this happens; and debugging can be difficult because the images that contain the software lack some tools like strace. Honestly, we started pinning because lxd-4.10 broke something for us; sorry, I can't remember what.

And then you have an old and deprecated OS like CentOS7, to which we have to stick to because upgading it to newer major versions is impossible; the only supported option is "reinstall". We know very well: we spent oh-so-many man hours coming up with a script for automating the 6->7 upgrade, and yes, we should be facing the migration to 8 right now, but since what happened with CentOS post-8, we're even thinking of jumping ships. But I digress.

My first reaction was: LXD must have changed. But of course it didn't, snap obediently followed the order of staying on 4.9. Next thing in my head was: it's a kernel issue. Unluckily I had purged the old kernels, so I had to download them from repos and install them, all by hand. It didn't matter: 21.10 refused to properly boot with it. In fact the system boots, but for some reason Xorg and the wifi were not working, so this meant my laptop became not only mostly useless, but also unfixable due to the lack of internet to find causes and solutions.

The most annoying thing was that, from the way I was experiencing it, it mostly seemed like terraform-provider-lxd couldn't properly talk to LXD: the provider was creating network resources fine, but container resources didn't finish. Checking with lxc⁵ I could see the containers coming up, but it was like the provider didn't see them there. It was not clear if it couldn't talk to LXD or something else⁶.

I spent some time trying to debug that, but as I said, neither LXD nor that provider have changed. Decoupling the provider from terraform was impossible, so I was mostly debugging through two or three layers, which feels like driving a car by giving instructions like "press the gas pedal¹" to another person by telephone. And then there's the issue of snap images not having tools, so at most you can only count on the server's own debuggability wich, in the case of LXD, for the API part it seems fine, but the "I'm creating containers" part was frankly inexistant. I really needed to see why the containers were behaving differently now because I got to the point were I was sure cloud-init was not running, and realizing that not even systemd was running on them.

So, how can we debug this? LXD is not that hard, but then you have the extra layer of snap. For that, you will have to use these spells:

$ sudo snap stop lxd
$ sudo snap run --shell lxd
# cd $SNAP
# ./commands/lxd --debug

But the fact is, you won't get more information than if you run the client with --debug, because it seems not to log any info about creating the containers themselves.

As for systemd, it's harder to debug because it can't assume there's anything there to support the logs, so your only options are either a console like device or its own internal journal. In this particular case I tried the journal, but when I was running journalctl to see any messages, it complained about not finding any log files and that the journal was empty. So, I used kmsg²:

CMD=[ "/usr/sbin/init", "--log-level=debug", "--log-target=kmsg" ]

(I'm cheating here, because this is when I was trying to use our other tool, the one based on docker, but it's good because it allowed me to provide those options. I'm not sure how to do the same with LXD).

This gave us the first glimpse to a solution:

[448984.949949] systemd[1]: Mounting cgroup to /sys/fs/cgroup/cpuset of type cgroup with options cpuset.
[448984.949963] systemd[1]: Failed to mount cgroup at /sys/fs/cgroup/cpuset: No such file or directory

systemd is complainig about some cgroup files missing. But checking my mount points I can clearly see the cgroup2 filesystem³ mounted:

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

But systemd is right, those files are not there:

17:29 $ ls -l /sys/fs/cgroup/
-r--r--r--   1 root root 0 Oct 22 15:43 cgroup.controllers
-rw-r--r--   1 root root 0 Oct 22 15:43 cgroup.max.depth
-rw-r--r--   1 root root 0 Oct 22 15:43 cgroup.max.descendants
-rw-r--r--   1 root root 0 Nov  4 12:21 cgroup.procs
-r--r--r--   1 root root 0 Oct 22 15:43 cgroup.stat
-rw-r--r--   1 root root 0 Nov  4 12:21 cgroup.subtree_control
-rw-r--r--   1 root root 0 Oct 22 15:43 cgroup.threads
-rw-r--r--   1 root root 0 Oct 22 15:43 cpu.pressure
-r--r--r--   1 root root 0 Oct 22 15:43 cpuset.cpus.effective
-r--r--r--   1 root root 0 Oct 22 15:43 cpuset.mems.effective
-r--r--r--   1 root root 0 Oct 22 15:43 cpu.stat
drwxr-xr-x   2 root root 0 Oct 22 15:43 dev-hugepages.mount
drwxr-xr-x   2 root root 0 Oct 22 15:43 dev-mqueue.mount
drwxr-xr-x   2 root root 0 Oct 22 16:04 init.scope
-rw-r--r--   1 root root 0 Oct 22 15:43 io.cost.model
-rw-r--r--   1 root root 0 Oct 22 15:43 io.cost.qos
-rw-r--r--   1 root root 0 Oct 22 15:43 io.pressure
-r--r--r--   1 root root 0 Oct 22 15:43 io.stat
drwxr-xr-x   2 root root 0 Oct 22 16:08 lxc.pivot
-r--r--r--   1 root root 0 Oct 22 15:43 memory.numa_stat
-rw-r--r--   1 root root 0 Oct 22 15:43 memory.pressure
-r--r--r--   1 root root 0 Oct 22 15:43 memory.stat
-r--r--r--   1 root root 0 Oct 22 15:43 misc.capacity
drwxr-xr-x   2 root root 0 Oct 22 16:04 -.mount
drwxr-xr-x   2 root root 0 Oct 22 15:43 proc-fs-nfsd.mount
drwxr-xr-x   2 root root 0 Oct 22 15:43 proc-sys-fs-binfmt_misc.mount
drwxr-xr-x   2 root root 0 Oct 22 15:43 sys-fs-fuse-connections.mount
drwxr-xr-x   2 root root 0 Oct 22 15:43 sys-kernel-config.mount
drwxr-xr-x   2 root root 0 Oct 22 15:43 sys-kernel-debug.mount
drwxr-xr-x   2 root root 0 Nov  3 13:21 sys-kernel-debug-tracing.mount
drwxr-xr-x   2 root root 0 Oct 22 15:43 sys-kernel-tracing.mount
drwxr-xr-x 154 root root 0 Nov  8 17:27 system.slice
drwxr-xr-x   3 root root 0 Oct 22 15:43 user.slice

"Maybe [an] incompaible kernel?" I asked my coleague. He suggested some options, but they were no longer relevant for the kernel version I was running. The next day (yes, this was a multi day endeavor!) I noticed our CI runners had different cgroups:

mdione@uk-sjohnson:~$ ls -l /sys/fs/cgroup/
dr-xr-xr-x 17 root root  0 Oct 28 23:24 blkio
lrwxrwxrwx  1 root root 11 Oct 28 23:24 cpu -> cpu,cpuacct
lrwxrwxrwx  1 root root 11 Oct 28 23:24 cpuacct -> cpu,cpuacct
dr-xr-xr-x 17 root root  0 Oct 28 23:24 cpu,cpuacct
dr-xr-xr-x  6 root root  0 Oct 28 23:24 cpuset
dr-xr-xr-x 17 root root  0 Oct 28 23:24 devices
dr-xr-xr-x  7 root root  0 Oct 28 23:24 freezer
dr-xr-xr-x  6 root root  0 Oct 28 23:24 hugetlb
dr-xr-xr-x 17 root root  0 Oct 28 23:24 memory
lrwxrwxrwx  1 root root 16 Oct 28 23:24 net_cls -> net_cls,net_prio
dr-xr-xr-x  6 root root  0 Oct 28 23:24 net_cls,net_prio
lrwxrwxrwx  1 root root 16 Oct 28 23:24 net_prio -> net_cls,net_prio
dr-xr-xr-x  6 root root  0 Oct 28 23:24 perf_event
dr-xr-xr-x 17 root root  0 Oct 28 23:24 pids
dr-xr-xr-x  5 root root  0 Oct 28 23:24 rdma
dr-xr-xr-x 17 root root  0 Oct 28 23:24 systemd
dr-xr-xr-x 16 root root  0 Nov  4 13:50 unified

I checked the mount poins in both systems and they both mentioned the same cgroup2 filesystem, but looking again revealed the problem:

cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)

21.10 is mounting cgroup2 but not any of the cgroup (v1) filesystems we could see in 20.04 (LTS) Like just above. It took me a while, but I found it in the release notes⁴:

systemd is being switched to the “unified” cgroup hierarchy (cgroup v2) by default. If for some reason you need to keep the legacy cgroup v1 hierarchy, you can select it via a kernel parameter at boot time: systemd.unified_cgroup_hierarchy=0

Which is exactly what my coleague suggested me. After one reboot, and some fixes due to having upgraded most of our stack, both our LXD and docker based tools were booting old CentOS7´s systemd without a hitch.

I wonder what we'll call it when and if most cars are electric. ↩
Interestingly, systemd-245's manpage I have in my brand new machine running KDE Neon has almost circular references for some options:

--log-target= Set log target. See systemd.log_target above.

systemd.log_target= Controls log output, with the same effect as the $SYSTEMD_LOG_TARGET

SYSTEMD_LOG_TARGET systemd reads the log target from this environment variable. This can be overridden with --log-target=. ↩
There's some foreshadowing here. Notice the difference between cgroup and cgroup2. ↩
It's a shame that Ubuntu decided to move from their wiki to discourse for presenting this info: discourse does not provide a way to link to a section of the release notes, while the wiki does. ↩
One can only wonder why they decided to call the CLI tool with the same name of the underlying tech; that is, LXD vs lxc vs lxc-*, which is what LXC's CLI tools are called. ↩
I think the problem is that network interfaces are not configured, and maybe the provider tries to connect to them? ↩