Atomic write works by first writing a temporary file, then syncing that
temporary file to ensure it is fully on disk before the program can
continue, and in the last step renaming the temporary file to the
target. The middle step was missing, which is likely to lead to a
truncated target file being present after power loss. Add this step.
Furthermore, even with this fix, atomicity is not fully guaranteed,
because FAT32 can become corrupted after power loss due to its design
shortcomings. Even though we cannot really do anything about this case,
adjust the comment to at least acknowledge the situation.
Since most files (stubs, kernels and initrds) on the ESP are properly
input-addressed or content-addressed now, there is no point in
overwriting them any more. Hence we detect what generations are already
properly installed, and don't reinstall them any more.
This approach leads to two distinct improvements:
* Rollbacks are more reliable, because initrd secrets and stubs do not
change any more for existing generations (with the necessary exception
of stubs in case of signature key rotation). In particular, the risk
of a newer stub breaking (for example, because of bad interactions
with certain firmware) old and previously working generations is
avoided.
* Kernels and initrds that are not going to be (re)installed anyway are
not read and hashed any more. This significantly reduces the I/O and
CPU time required for the installation process, particularly when
there is a large number of generations.
The following drawbacks are noted:
* The first time installation is performed after these changes, most of
the ESP is re-written at a different path; as a result, the disk usage
increases to roughly the double until the GC is performed.
* If multiple generations share a bare initrd, but have different
secrets scripts, the final initrds will now be separated, leading to
increased disk usage. However, this situation should be rare, and the
previous behavior was arguably incorrect anyway.
* If the files on the ESP are corrupted, running the installation again
will not overwrite them with the correct versions. Since the files are
written atomically, this situation should not happen except in case of
file system corruption, and it is questionable whether overwriting
really fixes the problem in this case.
The stubs on the ESP are now input-addressed, where the inputs are the
system toplevel and the public key used for signature. This way, it is
guaranteed that any stub at a given path will boot the desired system,
even in the presence of one of the two edge-cases where it was not
previously guaranteed:
* The latest generation was deleted at one point, and its generation
number was reused by a different system configuration. This is
detected because the toplevel will change.
* The secure boot signing key was rotated, so old stubs would not boot
at all any more. This is detected because the public key will change.
Avoiding these two cases will allow to skip reinstallation of stubs that
are already in place at the correct path.
Kernels and initrds on the ESP are now content-addressed. By definition,
it is impossible for two different kernels or initrds to ever end up at
the same place, even in the presence of changing initrd secrets or other
unreproducibility.
The basic advantage of this is that installing the kernel or initrd for
a generation can never break another generation. In turn, this enables
the following two improvements:
* All generations can be installed independently. In particular, the
installation can be performed in one pass, one generation at a time.
As a result, the code is significantly simplified, and memory usage
(due to the temporary files) does not grow with the number of
generations any more.
* Generations that already have their files in place on the ESP do not
need to be reinstalled. This will be taken advantage of in a
subsequent commit.
Architecture is now a generic structure that can be specialized
via an "external" trait for generating the paths you care about
depending on your target bootloader.
systemd-boot is now installed once for many generations rather than multiple times.
This means it is not really possible to manage different system in the same "machine", which is a very
obscure usecase, theoretically possible, but not yet encountered.
We will hard fail in case of encountering different architectures in bootspec.
This should still be compatible with cross-compiling systems in the future.
This generates `lzbt-systemd` binary instead of `lzbt`
which is using a special systemd-specific entrypoint.
This is part of the effort to enable multiple backends.
Bootspec has a mechanism called synthesis where you can synthesize
bootspecs if they are not present based on the generation link only.
This is useful for "vanilla bootspec" which does not contain any
extensions, as this is what we do right now.
If we need extensions, we can also implement our synthesis mechanism on
the top of it.
Enabling synthesis gives us the superpower to support non-bootspec
users. :-)
The message about malformed generatiosn should semantically be a
warning. However, since users might have hundres of old and thus
malformed generations and can do little about it, this should remain a
debug message. This way the user is not spammed with no-op warnings
while still enabling debugging.
lzbt currently happily nukes all boot entries, if it can't parse any
bootspecs. With the upcoming incompatible bootspec change, this might
be a problem that's worth avoiding. :)
I changed lzbt to fail hard in case, it can't generate any boot
items.
Due to the use of hash maps, the order of file installation was not
deterministic. I've changed the code the use BTreeMaps instead, which
makes this deterministic. While I was here, I tried to simplify the
code a bit.
To minimize writes to the ESP but still find necessary changes, compare
the hashes of the files on the ESP with the "expected" hashes. Only copy
and overwrite already existing files if the hashes don't match. This
ensures a working-as-expected state on the ESP as opposed to previously
where already existing files were just ignored.
Previously, generations were installed one after another. Now all
artifacts (kernels, initrd etc.) are first collected and then installed.
This way the writes to the ESP are reduced as duplicate paths are
already removed in the collection phase.
Using random names for tempfiles makes handling them easier. It reduces
the amount of noise in the code because no custom name needs to be
provided for each tempfile. The names were not really useful in any
case.
It also does not burden the developer with ensuring uniqueness of names.
This is relevant when files for multiple generations need to be stored
in the same directory (e.g. because they need to be accessed after
handling one generation).
Out of an abundance of caution, 32 random alphanumeric characters are
chosen for each filename. The tempfile crate, in comparison, only
chooses 8. 32 characters should be enough to avoid collisions, even
if the PRNG is not of cryptographic quality.
Leverage the bootspec `label` field in its intended way. The VERSION_ID
of the os-release in the stub now only contains the generation number
and the build time. This makes a correct PRETTY_NAME entirely dependent
on correct information in the bootspec `label` field.
Read the build time from generation symlinks in /nix/var/nix/profiles
instead of from the underlying derivation. The derivation build time
will always be a UNIX epoch of 0 because of the `nix-build` sandbox,
which is useless for identifying when a generation was created.
Malicious boot loader specification entries could be used to make a
signed kernel load arbitrary unprotected initrds. Since we do not want
this, do not sign the kernel. This way, the only things allowed to boot
are our UKI stubs, which do verify the initrd.
To minimize the number of arguments passed to `lzbt`, the loader config
is assembled outside `lzbt` and passed as a single argument.
Instead of reimplementing `consoleMode` under the `lanzaboote`
namespace, `config.loader.systemd-boot.consoleMode` is reused as is.
To minimize the potential for irrecoverable errors, only atomic writes
to the ESP are performed. This is implemented by first copying the file
to the destination with a `.tmp` suffix and then renaming it to the
final desintation. This is atomic because the rename operation is atomic
on POSIX platforms.
Specifically, this means that even if the system crashes during the
operation, the final desintation path will most likely be intact if it
exists at all. There are some nuances to this however. It **cannot** be
actually guaranteed that the operation was performed on the filesystem
level. However, this is the best we can do for now.
For reference:
- POSIX rename(2): https://pubs.opengroup.org/onlinepubs/9699919799/
- Rust fs::rename corresponds to rename(2) on Unix: https://doc.rust-lang.org/std/fs/fn.rename.html
- Rust fs::rename is implemented using libc's rename: https://github.com/rust-lang/rust/blob/master/library/std/src/sys/unix/fs.rs#L1397
- Renaming in libc is atomic: https://www.gnu.org/software/libc/manual/html_node/Renaming-Files.html
To make handling systemd versions more robust, they are parsed into a
u32 tuple instead of an f32. Additionally, some unit tests for correct
parsing and comparing of versions are added.
The process of installing systemd-boot is "smarter" because it now
considers a a few conditions instead of doing nothing if there is a file
at the deistination path. systemd-boot is now forcibly installed (i.e.
overwriting any file at the destination) if (1) there is no file at the
destination, OR (2) a newer version of systemd-boot is available, OR (3)
the signature of the file at the destination could not be verified.
To access paths on the ESP before or after installing generations, split
EspPaths into general EspPaths that only depend on the path to the ESP
and EspGenerationPaths which additionally depend on generation specific
information (e.g. version number and initrd filename).
Add an extension to TempDir that allows to create secure tempfiles. This
way, everything related to creating secure tempfiles is bundled in a
single place and can easily be reused.