Notes From Non Booting Ubuntu Server
By hernil
Update 2024-03-21: This was probably due to this bug being triggered. Effectively rendering the bpool subvolume unusable by grub after snapshotting. The top response here might be a good starting point for working around this with ZFSBootMenu which is what I ended up doing as well.
Original post:
Yesterday my server did not come back up. After a scheduled downtime due to a power outage in the building the server was supposed to boot back up but never came online. When looking into it it turned out it was stuck on the grub rescue menu claiming it did not find the kernel listed. This was a 6.5 kernel that I did spot being installed some time ago as part of the server being enrolled in the HWE kernel lifecycle.
My root setup was a “traditional” Ubuntu server installation with a ZFS root. This means we have an rpool
zpool and a seperate bpool
that has a limited subset of ZFS features that grub is able to understand and boot from.
So I had a non-booting server unable to find the 6.5 kernels that grub thought was supposed to be there. So a few reflections and things to note
- I had removed the zsys packages that Ubuntu uses to manage ZFS snapshots and tie it to Grub
- This is because I use Sanoid to manage my snapshots and zsys was honestly getting in the way
- I had a “ZFS on root” but without the ability to recover a broken system from a snapshot
- I did not know why grub couldn’t find the required kernel
- Were they actually missing?
- Were they there but not registered by grub - usually something wrong with the update-grub step?
- Was there a problem with
bpool
so that grub could not access the kernels although everything else was “good”? ie. had the limited feature set ofbpool
somehow ended up upgraded?
I had recently set up a new Ubuntu 22.04 server with ZFSBootMenu to properly enjoy the rollback flexibility of ZFS for exactly these scenarios. So with that in mind
- What was the quickest way to getting this up and running?
- What is my long term plan to avoid this kind of thing again?
- Was 1 & 2 at all overlapping?
ZFSBootMenu take 1
I created a USB drive with a portable image of ZFSBootMenu as outlined in the docs and got that booted. ZBM could see my rpool
as it looks for datasets with a mountpoint of /
(root) but could not boot it as /boot
was on another zpool - a configuration explicitely not supported with ZBM. At least I could see that everything seemed fine and healthy for my pools.
Ubuntu ISO booting
I then booted an installation ISO of Ubuntu server to see if I could chroot
and fix things from there. chroot
with a ZFS install is slightly different to doing it with traditional file systems but this guide here does a great job of walking you through it. I did run into a few discrepancies especially with regards to the structure of mounting bpool
into /boot
correctly. I got a nested /boot/boot
folder structure. This probably was the reason why I couldn’t recover properly using this method although I did try reinstalling the kernel, and messing around with grub a bit.
Sorting out the mounting would probably have worked just fine in the end. But honestly I never really liked grub as the whole “postinstall hook to generate config” step on kernel upgrades seems brittle to me when running a somewhat unconventional setup. Also the grub rescue console has never really helped me much more than just not being able to boot. I probably should have taken the time to learn grub better but booting a server is honstly not something I want to think about more than a strict minimum.
So after validating that there did not seem to be any 6.5 kernels in my bpool
, although a couple of 6.2 kernels were floating around I decided to double down on ZBM. At least it seemed to confirm that something happened in the “upgrade kernel and hook the grub config into that” step - although I would have expected there to be newer kernels available that grub was not aware of - not older kernels while grub expected newer ones.
ZFSBootMenu take 2
I decided that since I seem to have kernels available - just not the ones grub expected - that ZBM might actually bypass the whole problem for me if those kernels were just available in rpool
. Therefore I mounted both and simply copied the contents of bpool
into a new /boot
folder in the rpool/ROOT/ubuntu_dsjah
dataset.
Then set did the following
zfs set canmount=off bpool
zfs set canmount=off bpool/BOOT/ubuntu_dsjah
zfs set mountpoint=none bpool
zfs set mountpoint=none bpool/BOOT/ubuntu_dsjah
Pretty much disabling the bpool altogether. It will no longer mount and it will no longer attempt to mount to /boot
.
This I did while still booted to the Ubuntu ISO.
Then I rebooted to the ZBM flash drive and it found my rpool
again and let me attempt a boot one of the 6.2 kernels I observed earlier.
Ubuntu rescue
This time I was thrown into a Ubuntu rescue shell. Turns out this was because of my /etc/fstab
contained references to mounting the old bpool
to /boot
which due to the disabling over was impossible. Commented out the entries in fstab
and reboot did the trick.
Docker issues
I got back into the system that mainly is a file server and Docker host finding that Docker did not start. A few attempts to reinstall restart etc. yielded no progress. Turns out it complained about something related to iptables
- a Linux firewall / routing module. The problem turned out to be that the iptables
kernel modules were no where to be found. This might be due to how they are initially loaded from vmlinuz
images at boot or something - I honestly don’t know for sure.
A few attemps to get them back ended up with a failed boot due to not finding the ZFS kernel module. Although this time I could just roll back to a working ZFS snapshot due to using ZBM. In the end running this oneliner did the trick.
sudo dpkg -l | awk '{print $2}' | grep linux-modules | grep 6.2 | xargs sudo apt-get install --reinstall
RUN AT OWN RISK. Loosely it finds every linux-modules
dpkg expects to be there filters for the 6.2 kernel as that is what I run now and care about and reinstalls them.
Status
In the end we got everything back up after a prolonged down time. The install is now a franken ubuntu-on-zfs/ubuntu-on-zbm install which I wouldn’t exactly trust to upgrade very well in the future. So I’ll definetaly schedule a migration to proper a ZBM setup one of these days.
I’m also not quite sure what went wrong initially. But this together with ZFS userspace tools and kernel version mismatch has somewhat put me off the HWE kernel stack unless I really have to. Some time this summer I’ll get my systems up on 24.04 and stay there without any HWE unless there are some very compelling reasons.
Sources
Reply via email!