CentOS7 systemd not booting on LXD or docker containers on Ubuntu Impish
A few days after it came out, I upgraded from (K)Ubuntu 21.04 to 21.10. Yes, I am such a junkie.
It all seemed fine, at least desktop-wise, until I had to run some tests for the thing I develop at
$WORK. See, we use VM-like containers on top of both Docker and LXD. The images that these build
upon are some variation of CentOS7.x be cause that's what we use as our product's OS. When I say
VM-like, I mean that inside runs a mostly full OS from systemd
and up, not just a single process,
which is at least what Docker recommends. If you didn't guessed from the title, the issue was that
the containers were not booting properly.
To complicate things, we actually have a quite deep stack, which we try to keep as stable as possible.
On top of it all, we have a Pyhon+bash layer that allows us to describe things in terms of our product.
Right below this we have Terraform, and to talk to LXD we have, of course, terraform-provider-lxd
.
LXD itself seems to be a mix of a CLI, an API server, and maybe LXC underneath? And since there are no
more proper .deb
s anymore, you can only install it through snap
, which also adds a layer of
complexity. This means two things: pinning to a version is not automatically possible because versions
disappear from the snap
repositories, meaning that deploying new runner nodes becomes impossible once
this happens; and debugging can be difficult because the images that contain the software lack some
tools like strace
. Honestly, we started pinning because lxd-4.10
broke something for us; sorry, I
can't remember what.
And then you have an old and deprecated OS like CentOS7, to which we have to stick to because upgading it to newer major versions is impossible; the only supported option is "reinstall". We know very well: we spent oh-so-many man hours coming up with a script for automating the 6->7 upgrade, and yes, we should be facing the migration to 8 right now, but since what happened with CentOS post-8, we're even thinking of jumping ships. But I digress.
My first reaction was: LXD must have changed. But of course it didn't, snap
obediently followed the
order of staying on 4.9. Next thing in my head was: it's a kernel issue. Unluckily I had purged the old
kernels, so I had to download them from repos and install them, all by hand. It didn't matter: 21.10
refused to properly boot with it. In fact the system boots, but for some reason Xorg and the wifi were
not working, so this meant my laptop became not only mostly useless, but also unfixable due to the lack
of internet to find causes and solutions.
The most annoying thing was that, from the way I was experiencing it, it mostly seemed like
terraform-provider-lxd
couldn't properly talk to LXD: the provider was creating network resources
fine, but container resources didn't finish. Checking with lxc
5 I could see the containers coming
up, but it was like the provider didn't see them there. It was not clear if it couldn't talk to LXD
or something else6.
I spent some time trying to debug that, but as I said, neither LXD nor that provider have changed.
Decoupling the provider from terraform was impossible, so I was mostly debugging through two or three
layers, which feels like driving a car by giving instructions like "press the gas pedal1" to another
person by telephone. And then there's the issue of snap
images not having tools, so at most you
can only count on the server's own debuggability wich, in the case of LXD, for the API part it seems
fine, but the "I'm creating containers" part was frankly inexistant. I really needed to see why the
containers were behaving differently now because I got to the point were I was sure cloud-init
was
not running, and realizing that not even systemd
was running on them.
So, how can we debug this? LXD is not that hard, but then you have the extra layer of snap
. For that,
you will have to use these spells:
$ sudo snap stop lxd $ sudo snap run --shell lxd # cd $SNAP # ./commands/lxd --debug
But the fact is, you won't get more information than if you run the client with --debug
, because it
seems not to log any info about creating the containers themselves.
As for systemd
, it's harder to debug because it can't assume there's anything there to support the logs,
so your only options are either a console like device or its own internal journal
. In this particular case
I tried the journal, but when I was running journalctl
to see any messages, it complained about not
finding any log files and that the journal was empty. So, I used kmsg
2:
CMD=[ "/usr/sbin/init", "--log-level=debug", "--log-target=kmsg" ]
(I'm cheating here, because this is when I was trying to use our other tool, the one based on docker, but it's good because it allowed me to provide those options. I'm not sure how to do the same with LXD).
This gave us the first glimpse to a solution:
[448984.949949] systemd[1]: Mounting cgroup to /sys/fs/cgroup/cpuset of type cgroup with options cpuset. [448984.949963] systemd[1]: Failed to mount cgroup at /sys/fs/cgroup/cpuset: No such file or directory
systemd
is complainig about some cgroup
files missing. But checking my mount points I can clearly see
the cgroup2
filesystem3 mounted:
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
But systemd
is right, those files are not there:
17:29 $ ls -l /sys/fs/cgroup/ -r--r--r-- 1 root root 0 Oct 22 15:43 cgroup.controllers -rw-r--r-- 1 root root 0 Oct 22 15:43 cgroup.max.depth -rw-r--r-- 1 root root 0 Oct 22 15:43 cgroup.max.descendants -rw-r--r-- 1 root root 0 Nov 4 12:21 cgroup.procs -r--r--r-- 1 root root 0 Oct 22 15:43 cgroup.stat -rw-r--r-- 1 root root 0 Nov 4 12:21 cgroup.subtree_control -rw-r--r-- 1 root root 0 Oct 22 15:43 cgroup.threads -rw-r--r-- 1 root root 0 Oct 22 15:43 cpu.pressure -r--r--r-- 1 root root 0 Oct 22 15:43 cpuset.cpus.effective -r--r--r-- 1 root root 0 Oct 22 15:43 cpuset.mems.effective -r--r--r-- 1 root root 0 Oct 22 15:43 cpu.stat drwxr-xr-x 2 root root 0 Oct 22 15:43 dev-hugepages.mount drwxr-xr-x 2 root root 0 Oct 22 15:43 dev-mqueue.mount drwxr-xr-x 2 root root 0 Oct 22 16:04 init.scope -rw-r--r-- 1 root root 0 Oct 22 15:43 io.cost.model -rw-r--r-- 1 root root 0 Oct 22 15:43 io.cost.qos -rw-r--r-- 1 root root 0 Oct 22 15:43 io.pressure -r--r--r-- 1 root root 0 Oct 22 15:43 io.stat drwxr-xr-x 2 root root 0 Oct 22 16:08 lxc.pivot -r--r--r-- 1 root root 0 Oct 22 15:43 memory.numa_stat -rw-r--r-- 1 root root 0 Oct 22 15:43 memory.pressure -r--r--r-- 1 root root 0 Oct 22 15:43 memory.stat -r--r--r-- 1 root root 0 Oct 22 15:43 misc.capacity drwxr-xr-x 2 root root 0 Oct 22 16:04 -.mount drwxr-xr-x 2 root root 0 Oct 22 15:43 proc-fs-nfsd.mount drwxr-xr-x 2 root root 0 Oct 22 15:43 proc-sys-fs-binfmt_misc.mount drwxr-xr-x 2 root root 0 Oct 22 15:43 sys-fs-fuse-connections.mount drwxr-xr-x 2 root root 0 Oct 22 15:43 sys-kernel-config.mount drwxr-xr-x 2 root root 0 Oct 22 15:43 sys-kernel-debug.mount drwxr-xr-x 2 root root 0 Nov 3 13:21 sys-kernel-debug-tracing.mount drwxr-xr-x 2 root root 0 Oct 22 15:43 sys-kernel-tracing.mount drwxr-xr-x 154 root root 0 Nov 8 17:27 system.slice drwxr-xr-x 3 root root 0 Oct 22 15:43 user.slice
"Maybe [an] incompaible kernel?" I asked my coleague. He suggested some options, but they were no longer relevant for the kernel version I was running. The next day (yes, this was a multi day endeavor!) I noticed our CI runners had different cgroups:
mdione@uk-sjohnson:~$ ls -l /sys/fs/cgroup/ dr-xr-xr-x 17 root root 0 Oct 28 23:24 blkio lrwxrwxrwx 1 root root 11 Oct 28 23:24 cpu -> cpu,cpuacct lrwxrwxrwx 1 root root 11 Oct 28 23:24 cpuacct -> cpu,cpuacct dr-xr-xr-x 17 root root 0 Oct 28 23:24 cpu,cpuacct dr-xr-xr-x 6 root root 0 Oct 28 23:24 cpuset dr-xr-xr-x 17 root root 0 Oct 28 23:24 devices dr-xr-xr-x 7 root root 0 Oct 28 23:24 freezer dr-xr-xr-x 6 root root 0 Oct 28 23:24 hugetlb dr-xr-xr-x 17 root root 0 Oct 28 23:24 memory lrwxrwxrwx 1 root root 16 Oct 28 23:24 net_cls -> net_cls,net_prio dr-xr-xr-x 6 root root 0 Oct 28 23:24 net_cls,net_prio lrwxrwxrwx 1 root root 16 Oct 28 23:24 net_prio -> net_cls,net_prio dr-xr-xr-x 6 root root 0 Oct 28 23:24 perf_event dr-xr-xr-x 17 root root 0 Oct 28 23:24 pids dr-xr-xr-x 5 root root 0 Oct 28 23:24 rdma dr-xr-xr-x 17 root root 0 Oct 28 23:24 systemd dr-xr-xr-x 16 root root 0 Nov 4 13:50 unified
I checked the mount poins in both systems and they both mentioned the same cgroup2
filesystem, but
looking again revealed the problem:
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
21.10 is mounting cgroup2
but not any of the cgroup
(v1) filesystems we could see in 20.04 (LTS)
Like just above. It took me a while, but I found it in the
release notes4:
systemd
is being switched to the “unified” cgroup hierarchy (cgroup v2) by default. If for some reason you need to keep the legacy cgroup v1 hierarchy, you can select it via a kernel parameter at boot time:systemd.unified_cgroup_hierarchy=0
Which is exactly what my coleague suggested me. After one reboot, and some fixes due to having upgraded
most of our stack, both our LXD and docker based tools were booting old CentOS7´s systemd
without a
hitch.
-
I wonder what we'll call it when and if most cars are electric. ↩
-
Interestingly,
systemd-245
's manpage I have in my brand new machine running KDE Neon has almost circular references for some options:--log-target= Set log target. See systemd.log_target above.
systemd.log_target= Controls log output, with the same effect as the $SYSTEMD_LOG_TARGET
SYSTEMD_LOG_TARGET systemd reads the log target from this environment variable. This can be overridden with --log-target=. ↩
-
There's some foreshadowing here. Notice the difference between
cgroup
andcgroup2
. ↩ -
It's a shame that Ubuntu decided to move from their wiki to discourse for presenting this info: discourse does not provide a way to link to a section of the release notes, while the wiki does. ↩
-
One can only wonder why they decided to call the CLI tool with the same name of the underlying tech; that is, LXD vs
lxc
vslxc-*
, which is what LXC's CLI tools are called. ↩ -
I think the problem is that network interfaces are not configured, and maybe the provider tries to connect to them? ↩