add existing documentation
This commit is contained in:
375
doc/boot-process.txt
Executable file
375
doc/boot-process.txt
Executable file
@@ -0,0 +1,375 @@
|
||||
================================================================================
|
||||
| Rosetta Operating System |
|
||||
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
|
||||
| The Boot Process |
|
||||
================================================================================
|
||||
|
||||
1 Bootloader
|
||||
------------
|
||||
|
||||
The bootloader loads the kernel executable and an initrd.
|
||||
|
||||
|
||||
2 Kernel
|
||||
--------
|
||||
|
||||
The kernel initialises itself, extracts the bootstrap program from the initrd
|
||||
and executes it.
|
||||
|
||||
The initrd is an EC3 image containing (in most cases) two key items:
|
||||
1) A bootstrap executable.
|
||||
2) A volume containing the boot filesystem.
|
||||
|
||||
This data is stored in several 'tags' within the container:
|
||||
* VOLU, CTAB, STAB, and XATR for the boot filesystem volume (ignored by
|
||||
the kernel)
|
||||
* EXEC for the bootstrap program.
|
||||
|
||||
(technically speaking, the only hard requirement as far as the kernel is
|
||||
concerned is the EXEC tag. The initrd could contain any number of other
|
||||
volumes or tags, including none at all)
|
||||
|
||||
The boot filesystem is ignored by the kernel. It is up to the bootstrap
|
||||
program to make use of it.
|
||||
|
||||
The bootstrap program is a static ELF binary in an EXEC tag with an
|
||||
identifier of 0x555345524C414E44 ("USERLAND" in ASCII).
|
||||
|
||||
The key feature of the EXEC tag in an EC3 image is that, for static and flat
|
||||
binaries, it extracts the information needed to run the executable and stores
|
||||
it in a special data structure for easy parsing. This allows the reader (the
|
||||
kernel in this case) to load and run the executable without having to
|
||||
implement an ELF parser.
|
||||
|
||||
Such information includes:
|
||||
* The offset and size of the read-write (.data, .bss) and read-exec
|
||||
(.text, .rodata) segments both in the file (source) and in virtual
|
||||
memory (destination).
|
||||
* The entry point address.
|
||||
|
||||
The following structure can be found at the beginning of the EXEC tag.
|
||||
Any *_faddr variables are offsets relative to the beginning of the tag.
|
||||
|
||||
struct ec3_exec_aux {
|
||||
uint8_t e_type; // EXEC_ELF, EXEC_FLAT, etc
|
||||
|
||||
union {
|
||||
struct {
|
||||
uintptr_t rx_segment_faddr, rx_segment_vaddr;
|
||||
size_t rx_segment_fsize, rx_segment_vsize;
|
||||
|
||||
uintptr_t rw_segment_faddr, rw_segment_vaddr;
|
||||
size_t rw_segment_fsize, rw_segment_vsize;
|
||||
|
||||
uintptr_t entry;
|
||||
} i_elf;
|
||||
|
||||
struct {
|
||||
uintptr_t base;
|
||||
uintptr_t entry;
|
||||
} i_flat;
|
||||
} e_info;
|
||||
}
|
||||
|
||||
As long as you aren't reading any volumes, the EC3 image format is simple
|
||||
enough that finding the EXEC tag and reading its contents is a trivial
|
||||
operation. This minimises the amount of code needed in the kernel to find
|
||||
the bootstrap program.
|
||||
|
||||
The auxiliary information in the EXEC tag is enough for the kernel to copy
|
||||
the executable into memory, set the appropriate memory permissions, and
|
||||
jump to the entry point.
|
||||
|
||||
|
||||
3 Userland Bootstrap
|
||||
--------------------
|
||||
|
||||
The userland bootstrap program (or "userboot") is responsible for making
|
||||
available the boot filesystem and starting the system management task.
|
||||
|
||||
Any kernel tasks have a negative task ID, and the userland bootstrap task
|
||||
will always be given a task ID of zero. Therefore, the first task spawned by
|
||||
userboot will always have a task ID of 1.
|
||||
|
||||
Once the system management process is started, userboot can (but doesn't HAVE
|
||||
to) exit. The system management task will automatically become the root of
|
||||
the task tree.
|
||||
|
||||
If userboot exits without spawning any other tasks, the action taken will
|
||||
depend on the command-line arguments given to the kernel.
|
||||
|
||||
Some options include:
|
||||
* Shut the system down
|
||||
* Restart the system
|
||||
* Trigger a kernel panic
|
||||
|
||||
In most cases, userboot will remain running, providing the system management
|
||||
task with access to the boot filesystem until other drivers are online, at
|
||||
which point the bootstrap program will exit.
|
||||
|
||||
In more specialised cases, userboot can remain running for the life of the
|
||||
system. It can wait for the task it spawns to exit before taking some action.
|
||||
|
||||
This is useful for automated testing. The bootstrap program can run a program
|
||||
that will run the test suite (or could itself be a test suite program), wait
|
||||
for the tests to finish, and then shut down the system.
|
||||
|
||||
|
||||
3 System Management Task
|
||||
------------------------
|
||||
|
||||
The system management task will be in charge of the system for the entire
|
||||
time the system is up. It is responsible for starting device drivers and
|
||||
setting up an environment for the system to carry out its intended purpose
|
||||
(i.e. handling interactive user sessions).
|
||||
|
||||
Of course, the system management task can (and certainly should) delegate
|
||||
these tasks to other system services.
|
||||
|
||||
On Rosetta-based systems, system management duties are handled by the systemd
|
||||
daemon. systemd fulfills a few important roles, including:
|
||||
1) managing system services, and restarting them if they fail.
|
||||
2) loading and launching executables.
|
||||
3) managing the system namespace.
|
||||
|
||||
userboot sends commands to systemd to bring up the rest of the userland
|
||||
environment. During this process, systemd maintains a connection to userboot
|
||||
to load files from the boot filesystem. You might think that having two tasks
|
||||
communicate with each other (violating the strict one-way client-server
|
||||
message flow) would result in deadlocks, but a few key design choices in
|
||||
userboot and systemd avoid this.
|
||||
|
||||
technically, there is nothing wrong with two tasks waiting on each other, as
|
||||
long as two THREADS within those tasks don't end up (directly or indirectly)
|
||||
waiting on each other.
|
||||
|
||||
therefore, to ensure that this principle is not violated:
|
||||
1) systemd performs all process-launching activities and request-handling
|
||||
activities on separate threads that never wait on each other. when a
|
||||
request is received to launch a new process, systemd's request-handler
|
||||
thread dispatches the request (and the responsibility to respond to the
|
||||
client) to a separate loader thread. this allows systemd to continue
|
||||
servicing other requests (including filesystem requests from its own
|
||||
loader threads).
|
||||
2) userboot performs all system startup activities (including sending
|
||||
commands to systemd) and filesystem request-handing activities on
|
||||
separate threads that never wait on each other.
|
||||
|
||||
because of this, despite the circular communications between userboot and
|
||||
systemd, messages between the two tasks still technically only travel in a
|
||||
single direction when you consider their individual threads:
|
||||
|
||||
userboot[init] -----> systemd[req-handler]
|
||||
| :
|
||||
═════NO═COMMUNICATION═════ : (async task dispatch)
|
||||
| v
|
||||
userboot[fs-handler] <----- systemd[launcher]
|
||||
|
||||
key:
|
||||
task-name[thread-name]
|
||||
---> Request/reply exchange (the arrow points toward the request
|
||||
recipient)
|
||||
...> Non-blocking action (e.g. scheduling another thread to run)
|
||||
|
||||
technically, systemd[req-handler] schedules systemd[launcher] to run and
|
||||
doesn't wait on it. therefore, if userboot[init] sends a request to
|
||||
systemd[req-handler] to launch a server, it will receive a reply from
|
||||
systemd[launcher].
|
||||
|
||||
Because of the fixed order in which userboot and systemd are started, and
|
||||
the deterministic assignment of task IDs mentioned in the USERLAND BOOTSTRAP
|
||||
section, the channels that the two tasks use to communicate with each other
|
||||
have well-defined locations:
|
||||
|
||||
* userboot always has TID 0, and always hosts the boot filesystem on its
|
||||
first channel, giving a tuple of (nd:0, tid:0, chid:0).
|
||||
* systemd always has TID 1, and always hosts its system management
|
||||
interface on its first channel, giving a tuple of (nd:0, tid:1, chid:0).
|
||||
|
||||
|
||||
5 From Userboot to the Root Filesystem
|
||||
--------------------------------------
|
||||
|
||||
Now that we are familiar with the inner workings of these two critical tasks,
|
||||
lets go through the steps taken to bring up the full userland environment:
|
||||
|
||||
1) when userboot starts, it is given (by the kernel) a handle to a pagebuf
|
||||
object containing the initrd. userboot maps this pagebuf into its
|
||||
address space and mounts the initrd[1].
|
||||
2) userboot creates a new task to run the system management service.
|
||||
userboot contains just enough ELF-related code to do one of the
|
||||
following:
|
||||
* if the system management executable is statically-linked, simply copy
|
||||
the relevant ELF segments into the new task's address space and
|
||||
create a thread that will start running at the executable's entry
|
||||
point.
|
||||
* if the system management executable is dynamically-linked (the more
|
||||
likely scenario), load the dynamic linker[2] into the new task's
|
||||
address space and creates a new thread that will start running at the
|
||||
dynamic linker's entry point.
|
||||
3) systemd initialises the system namespace and mounts the boot filesystem
|
||||
provided by userboot at '/', temporarily making it the root filesystem.
|
||||
4) systemd starts the device manager service, emdevd, and instructs it
|
||||
to scan the system devices. this blocks systemd until the scan is
|
||||
complete.
|
||||
5) in response to a scan command, emdevd uses whatever drivers are
|
||||
available in the current root filesystem to find and initialise as many
|
||||
devices as possible. because the boot filesystem only contains the
|
||||
drivers needed to mount the root filesystem, this scan will be
|
||||
far from complete, but it will be repeated once the real root
|
||||
filesystem is available.
|
||||
6) eventually the scan will complete, and emdevd will return control
|
||||
back to systemd. at this point, the storage device containing the
|
||||
root filesystem has been found and brought online.
|
||||
7) emdevd provides a devfs-like interface to all the devices on the
|
||||
system. systemd mounts this pseudo-filesystem at '/dev' in the
|
||||
system namespace.
|
||||
8) systemd starts an instance of the filesystem server, fsd, and provides
|
||||
it with three parameters:
|
||||
* the path to the device node containing the root filesystem (e.g.
|
||||
'/dev/disk0s1')
|
||||
* the name of the filesystem format to be mounted (e.g. 'ext2')
|
||||
* the mount flags (the root filesystem is always mounted read-only
|
||||
during boot. once /etc/fstab is accessible, the root filesystem
|
||||
is re-mounted with the flags it specifies)
|
||||
9) fsd will load the necessary filesystem driver (e.g. for ext2
|
||||
filesystems, fsd will load fs-ext2.so) and mount the filesystem
|
||||
on the provided device.
|
||||
10) systemd mounts the filesystem provided by fsd to the root of
|
||||
the system namespace. at this point, the root filesystem is now
|
||||
available (albeit read-only for now).
|
||||
|
||||
Notes:
|
||||
[1] In this case, mounting doesn't involve the system namespace (until
|
||||
systemd starts up, there *is* no system namespace), but rather
|
||||
userboot creating any data structures it needs to be able to privately
|
||||
locate and read files within the boot image.
|
||||
[2] despite being a .so file, the dynamic linker is designed to be a
|
||||
self-contained position-independent executable with no external
|
||||
dependencies, in order to avoid a chicken-and-egg situation where the
|
||||
dynamic linker itself requires a dynamic linker to load. the only
|
||||
functionality required to load it (beyond copying its code and data
|
||||
into memory) is finding and iterating through the DYNAMIC segment,
|
||||
processing any relocation entries contained within.
|
||||
|
||||
|
||||
6 Runlevels
|
||||
-----------
|
||||
|
||||
the state of the system, and what functionality the system has, depends on
|
||||
which services are running. For example:
|
||||
* without deviced or fsd, no filesystems are available.
|
||||
* without lockdownd, user authentication and authorisation is not
|
||||
available.
|
||||
* without airportd, network connectivity is not available.
|
||||
* without seatd, multiplexing of peripherals between multiple user
|
||||
sessions is not available.
|
||||
* without sessiond, user sessions are not available.
|
||||
... and so on.
|
||||
|
||||
different sets of services can be brought online to tailor the available
|
||||
functionality. under systemd, these sets of services are called runlevels.
|
||||
runlevels are hierarchical, with higher runlevels building upon the
|
||||
functionality provided by lower runlevels. as the runlevel increases, the
|
||||
number of system services running on the machine increases.
|
||||
|
||||
|
||||
6.1 Pre-defined Runlevels
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Rosetta has a range of pre-defined runlevels:
|
||||
* Off:
|
||||
- Instructing systemd to move to this runlevel will shut the system down.
|
||||
* Minimal:
|
||||
- Only the root filesystem is available, and is read-only.
|
||||
- All device drivers are loaded, and all devices are visible.
|
||||
- All network interfaces are down, and no socket I/O is possible.
|
||||
- The security service is offline, so no authentication or authorisation
|
||||
checks can be performed, and the interactive user is effectively root.
|
||||
- Neither the session nor seat managers are online, so only one session
|
||||
is supported.
|
||||
- A basic console and shell are started to allow the user to interact
|
||||
with the system.
|
||||
* Single-User: Same as Minimal, except:
|
||||
- all filesystems mounts prescribed by /etc/fstab are performed.
|
||||
* Multi-User: Same as Single-User, except:
|
||||
- The security service is running, allowing user authentication.
|
||||
- System security and permissions are now enforced.
|
||||
- The seat and session manager services are running, allowing multiple
|
||||
user sessions to be running simultaneously.
|
||||
- instead of dropping straight into a shell, the interactive user is
|
||||
presented with a text-based login prompt before their shell is
|
||||
launched.
|
||||
* Networking: Same as Multi-User, except:
|
||||
- The networking service is running, and all network interfaces are
|
||||
brought up and configured according to system configuration.
|
||||
* Full Mode: Same as Networking, except:
|
||||
- The system's display manager is running, allowing for logging in
|
||||
and interacting with the system via a graphical user interface.
|
||||
|
||||
In most circumstances, the system will be running in one of the runlevels
|
||||
based on Multi-User. Not only does this enable most of the "usual" system
|
||||
functionality, but it also enforces user authentication and authorisation.
|
||||
The lower runlevels are mostly used for system administration and
|
||||
troubleshooting when there is a problem preventing the system from reaching
|
||||
a higher runlevel.
|
||||
|
||||
|
||||
6.2 How Runlevels Affect Security Enforcement
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
User authentication and authorisation depend on the system security service
|
||||
(lockdownd). Without it, no users can log on to the system, and no permission
|
||||
checks can be performed. So, how does a system behave when lockdownd isn't
|
||||
running?
|
||||
|
||||
There are a few circumstances where lockdownd may be offline, some
|
||||
intentional and some unintentional. The system may be booted in Minimal or
|
||||
Single-User mode. These runlevels don't start lockdownd as the interactive
|
||||
user is root by default. However, lockdownd may crash while running on a
|
||||
multi-user system.
|
||||
|
||||
So if you are an application or service running on a Rosetta system, and your
|
||||
attempt to connect to the security service fails because the service has
|
||||
stopped working, or was never running in the first place, what do you do?
|
||||
|
||||
The system management service keeps track of what runlevel the system is
|
||||
currently running at, and anyone can contact the service to query this
|
||||
information. So, you can take action depending on the system runlevel:
|
||||
* If the runlevel is Single-User or below, you know that system security
|
||||
is not being enforced, so there is no need to contact the security
|
||||
service.
|
||||
* If the runlevel is Multi-User or higher, you know that system security
|
||||
is (or should be) enforced. If the security service cannot be reached
|
||||
in this case, you should wait for the system management service to
|
||||
(re)start it. In the worst case scenario, where the security service
|
||||
cannot be started, all authentication and authorisation actions should
|
||||
be presumed to fail, so that there is never a lapse in security.
|
||||
|
||||
|
||||
7 From the Root Filesystem to User Interaction
|
||||
----------------------------------------------
|
||||
|
||||
Now that the root filesystem is available, we can start bringing other
|
||||
system components online. This process culminates in an interactive user
|
||||
session.
|
||||
|
||||
1) systemd instructs emdevd to perform another scan of the system devices.
|
||||
with a wider range of drivers now available, (hopefully) all devices
|
||||
will now be detected and initialised.
|
||||
2) systemd will now start working towards reaching a target runlevel.
|
||||
right now, the system is running at the Minimum runlevel. For the
|
||||
purposes of this document, let's assume that the target runlevel is
|
||||
Networking, and the system will move through the Single-User and Multi-
|
||||
User runlevels to get there.
|
||||
3) In order to reach the Single-User runlevel, the filesystem mounts
|
||||
specified in /etc/fstab must be performed. The Single-User runlevel
|
||||
defines a script for systemd to execute, which performs the necessary
|
||||
mount operations.
|
||||
4) The Multi-User runlevel is more complex and will require starting a
|
||||
range of services.
|
||||
5) First, the security service, lockdownd, is brought online. This is the
|
||||
pivotal service that converts the system from single-user to multi-user.
|
||||
|
||||
|
||||
vim: shiftwidth=3 expandtab
|
||||
Reference in New Issue
Block a user