Files
rosetta/doc/boot-process.txt

376 lines
18 KiB
Plaintext
Raw Normal View History

2024-11-02 15:09:10 +00:00
================================================================================
| Rosetta Operating System |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
| The Boot Process |
================================================================================
1 Bootloader
------------
The bootloader loads the kernel executable and an initrd.
2 Kernel
--------
The kernel initialises itself, extracts the bootstrap program from the initrd
and executes it.
The initrd is an EC3 image containing (in most cases) two key items:
1) A bootstrap executable.
2) A volume containing the boot filesystem.
This data is stored in several 'tags' within the container:
* VOLU, CTAB, STAB, and XATR for the boot filesystem volume (ignored by
the kernel)
* EXEC for the bootstrap program.
(technically speaking, the only hard requirement as far as the kernel is
concerned is the EXEC tag. The initrd could contain any number of other
volumes or tags, including none at all)
The boot filesystem is ignored by the kernel. It is up to the bootstrap
program to make use of it.
The bootstrap program is a static ELF binary in an EXEC tag with an
identifier of 0x555345524C414E44 ("USERLAND" in ASCII).
The key feature of the EXEC tag in an EC3 image is that, for static and flat
binaries, it extracts the information needed to run the executable and stores
it in a special data structure for easy parsing. This allows the reader (the
kernel in this case) to load and run the executable without having to
implement an ELF parser.
Such information includes:
* The offset and size of the read-write (.data, .bss) and read-exec
(.text, .rodata) segments both in the file (source) and in virtual
memory (destination).
* The entry point address.
The following structure can be found at the beginning of the EXEC tag.
Any *_faddr variables are offsets relative to the beginning of the tag.
struct ec3_exec_aux {
uint8_t e_type; // EXEC_ELF, EXEC_FLAT, etc
union {
struct {
uintptr_t rx_segment_faddr, rx_segment_vaddr;
size_t rx_segment_fsize, rx_segment_vsize;
uintptr_t rw_segment_faddr, rw_segment_vaddr;
size_t rw_segment_fsize, rw_segment_vsize;
uintptr_t entry;
} i_elf;
struct {
uintptr_t base;
uintptr_t entry;
} i_flat;
} e_info;
}
As long as you aren't reading any volumes, the EC3 image format is simple
enough that finding the EXEC tag and reading its contents is a trivial
operation. This minimises the amount of code needed in the kernel to find
the bootstrap program.
The auxiliary information in the EXEC tag is enough for the kernel to copy
the executable into memory, set the appropriate memory permissions, and
jump to the entry point.
3 Userland Bootstrap
--------------------
The userland bootstrap program (or "userboot") is responsible for making
available the boot filesystem and starting the system management task.
Any kernel tasks have a negative task ID, and the userland bootstrap task
will always be given a task ID of zero. Therefore, the first task spawned by
userboot will always have a task ID of 1.
Once the system management process is started, userboot can (but doesn't HAVE
to) exit. The system management task will automatically become the root of
the task tree.
If userboot exits without spawning any other tasks, the action taken will
depend on the command-line arguments given to the kernel.
Some options include:
* Shut the system down
* Restart the system
* Trigger a kernel panic
In most cases, userboot will remain running, providing the system management
task with access to the boot filesystem until other drivers are online, at
which point the bootstrap program will exit.
In more specialised cases, userboot can remain running for the life of the
system. It can wait for the task it spawns to exit before taking some action.
This is useful for automated testing. The bootstrap program can run a program
that will run the test suite (or could itself be a test suite program), wait
for the tests to finish, and then shut down the system.
3 System Management Task
------------------------
The system management task will be in charge of the system for the entire
time the system is up. It is responsible for starting device drivers and
setting up an environment for the system to carry out its intended purpose
(i.e. handling interactive user sessions).
Of course, the system management task can (and certainly should) delegate
these tasks to other system services.
On Rosetta-based systems, system management duties are handled by the systemd
daemon. systemd fulfills a few important roles, including:
1) managing system services, and restarting them if they fail.
2) loading and launching executables.
3) managing the system namespace.
userboot sends commands to systemd to bring up the rest of the userland
environment. During this process, systemd maintains a connection to userboot
to load files from the boot filesystem. You might think that having two tasks
communicate with each other (violating the strict one-way client-server
message flow) would result in deadlocks, but a few key design choices in
userboot and systemd avoid this.
technically, there is nothing wrong with two tasks waiting on each other, as
long as two THREADS within those tasks don't end up (directly or indirectly)
waiting on each other.
therefore, to ensure that this principle is not violated:
1) systemd performs all process-launching activities and request-handling
activities on separate threads that never wait on each other. when a
request is received to launch a new process, systemd's request-handler
thread dispatches the request (and the responsibility to respond to the
client) to a separate loader thread. this allows systemd to continue
servicing other requests (including filesystem requests from its own
loader threads).
2) userboot performs all system startup activities (including sending
commands to systemd) and filesystem request-handing activities on
separate threads that never wait on each other.
because of this, despite the circular communications between userboot and
systemd, messages between the two tasks still technically only travel in a
single direction when you consider their individual threads:
userboot[init] -----> systemd[req-handler]
| :
═════NO═COMMUNICATION═════ : (async task dispatch)
| v
userboot[fs-handler] <----- systemd[launcher]
key:
task-name[thread-name]
---> Request/reply exchange (the arrow points toward the request
recipient)
...> Non-blocking action (e.g. scheduling another thread to run)
technically, systemd[req-handler] schedules systemd[launcher] to run and
doesn't wait on it. therefore, if userboot[init] sends a request to
systemd[req-handler] to launch a server, it will receive a reply from
systemd[launcher].
Because of the fixed order in which userboot and systemd are started, and
the deterministic assignment of task IDs mentioned in the USERLAND BOOTSTRAP
section, the channels that the two tasks use to communicate with each other
have well-defined locations:
* userboot always has TID 0, and always hosts the boot filesystem on its
first channel, giving a tuple of (nd:0, tid:0, chid:0).
* systemd always has TID 1, and always hosts its system management
interface on its first channel, giving a tuple of (nd:0, tid:1, chid:0).
5 From Userboot to the Root Filesystem
--------------------------------------
Now that we are familiar with the inner workings of these two critical tasks,
lets go through the steps taken to bring up the full userland environment:
1) when userboot starts, it is given (by the kernel) a handle to a pagebuf
object containing the initrd. userboot maps this pagebuf into its
address space and mounts the initrd[1].
2) userboot creates a new task to run the system management service.
userboot contains just enough ELF-related code to do one of the
following:
* if the system management executable is statically-linked, simply copy
the relevant ELF segments into the new task's address space and
create a thread that will start running at the executable's entry
point.
* if the system management executable is dynamically-linked (the more
likely scenario), load the dynamic linker[2] into the new task's
address space and creates a new thread that will start running at the
dynamic linker's entry point.
3) systemd initialises the system namespace and mounts the boot filesystem
provided by userboot at '/', temporarily making it the root filesystem.
4) systemd starts the device manager service, emdevd, and instructs it
to scan the system devices. this blocks systemd until the scan is
complete.
5) in response to a scan command, emdevd uses whatever drivers are
available in the current root filesystem to find and initialise as many
devices as possible. because the boot filesystem only contains the
drivers needed to mount the root filesystem, this scan will be
far from complete, but it will be repeated once the real root
filesystem is available.
6) eventually the scan will complete, and emdevd will return control
back to systemd. at this point, the storage device containing the
root filesystem has been found and brought online.
7) emdevd provides a devfs-like interface to all the devices on the
system. systemd mounts this pseudo-filesystem at '/dev' in the
system namespace.
8) systemd starts an instance of the filesystem server, fsd, and provides
it with three parameters:
* the path to the device node containing the root filesystem (e.g.
'/dev/disk0s1')
* the name of the filesystem format to be mounted (e.g. 'ext2')
* the mount flags (the root filesystem is always mounted read-only
during boot. once /etc/fstab is accessible, the root filesystem
is re-mounted with the flags it specifies)
9) fsd will load the necessary filesystem driver (e.g. for ext2
filesystems, fsd will load fs-ext2.so) and mount the filesystem
on the provided device.
10) systemd mounts the filesystem provided by fsd to the root of
the system namespace. at this point, the root filesystem is now
available (albeit read-only for now).
Notes:
[1] In this case, mounting doesn't involve the system namespace (until
systemd starts up, there *is* no system namespace), but rather
userboot creating any data structures it needs to be able to privately
locate and read files within the boot image.
[2] despite being a .so file, the dynamic linker is designed to be a
self-contained position-independent executable with no external
dependencies, in order to avoid a chicken-and-egg situation where the
dynamic linker itself requires a dynamic linker to load. the only
functionality required to load it (beyond copying its code and data
into memory) is finding and iterating through the DYNAMIC segment,
processing any relocation entries contained within.
6 Runlevels
-----------
the state of the system, and what functionality the system has, depends on
which services are running. For example:
* without deviced or fsd, no filesystems are available.
* without lockdownd, user authentication and authorisation is not
available.
* without airportd, network connectivity is not available.
* without seatd, multiplexing of peripherals between multiple user
sessions is not available.
* without sessiond, user sessions are not available.
... and so on.
different sets of services can be brought online to tailor the available
functionality. under systemd, these sets of services are called runlevels.
runlevels are hierarchical, with higher runlevels building upon the
functionality provided by lower runlevels. as the runlevel increases, the
number of system services running on the machine increases.
6.1 Pre-defined Runlevels
~~~~~~~~~~~~~~~~~~~~~~~~~
Rosetta has a range of pre-defined runlevels:
* Off:
- Instructing systemd to move to this runlevel will shut the system down.
* Minimal:
- Only the root filesystem is available, and is read-only.
- All device drivers are loaded, and all devices are visible.
- All network interfaces are down, and no socket I/O is possible.
- The security service is offline, so no authentication or authorisation
checks can be performed, and the interactive user is effectively root.
- Neither the session nor seat managers are online, so only one session
is supported.
- A basic console and shell are started to allow the user to interact
with the system.
* Single-User: Same as Minimal, except:
- all filesystems mounts prescribed by /etc/fstab are performed.
* Multi-User: Same as Single-User, except:
- The security service is running, allowing user authentication.
- System security and permissions are now enforced.
- The seat and session manager services are running, allowing multiple
user sessions to be running simultaneously.
- instead of dropping straight into a shell, the interactive user is
presented with a text-based login prompt before their shell is
launched.
* Networking: Same as Multi-User, except:
- The networking service is running, and all network interfaces are
brought up and configured according to system configuration.
* Full Mode: Same as Networking, except:
- The system's display manager is running, allowing for logging in
and interacting with the system via a graphical user interface.
In most circumstances, the system will be running in one of the runlevels
based on Multi-User. Not only does this enable most of the "usual" system
functionality, but it also enforces user authentication and authorisation.
The lower runlevels are mostly used for system administration and
troubleshooting when there is a problem preventing the system from reaching
a higher runlevel.
6.2 How Runlevels Affect Security Enforcement
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
User authentication and authorisation depend on the system security service
(lockdownd). Without it, no users can log on to the system, and no permission
checks can be performed. So, how does a system behave when lockdownd isn't
running?
There are a few circumstances where lockdownd may be offline, some
intentional and some unintentional. The system may be booted in Minimal or
Single-User mode. These runlevels don't start lockdownd as the interactive
user is root by default. However, lockdownd may crash while running on a
multi-user system.
So if you are an application or service running on a Rosetta system, and your
attempt to connect to the security service fails because the service has
stopped working, or was never running in the first place, what do you do?
The system management service keeps track of what runlevel the system is
currently running at, and anyone can contact the service to query this
information. So, you can take action depending on the system runlevel:
* If the runlevel is Single-User or below, you know that system security
is not being enforced, so there is no need to contact the security
service.
* If the runlevel is Multi-User or higher, you know that system security
is (or should be) enforced. If the security service cannot be reached
in this case, you should wait for the system management service to
(re)start it. In the worst case scenario, where the security service
cannot be started, all authentication and authorisation actions should
be presumed to fail, so that there is never a lapse in security.
7 From the Root Filesystem to User Interaction
----------------------------------------------
Now that the root filesystem is available, we can start bringing other
system components online. This process culminates in an interactive user
session.
1) systemd instructs emdevd to perform another scan of the system devices.
with a wider range of drivers now available, (hopefully) all devices
will now be detected and initialised.
2) systemd will now start working towards reaching a target runlevel.
right now, the system is running at the Minimum runlevel. For the
purposes of this document, let's assume that the target runlevel is
Networking, and the system will move through the Single-User and Multi-
User runlevels to get there.
3) In order to reach the Single-User runlevel, the filesystem mounts
specified in /etc/fstab must be performed. The Single-User runlevel
defines a script for systemd to execute, which performs the necessary
mount operations.
4) The Multi-User runlevel is more complex and will require starting a
range of services.
5) First, the security service, lockdownd, is brought online. This is the
pivotal service that converts the system from single-user to multi-user.
vim: shiftwidth=3 expandtab