================================================================================ | Rosetta Operating System | | ~~~~~~~~~~~~~~~~~~~~~~~~ | | The Boot Process | ================================================================================ 1 Bootloader ------------ The bootloader loads the kernel executable and an initrd. 2 Kernel -------- The kernel initialises itself, extracts the bootstrap program from the initrd and executes it. The initrd is an EC3 image containing (in most cases) two key items: 1) A bootstrap executable. 2) A volume containing the boot filesystem. This data is stored in several 'tags' within the container: * VOLU, CTAB, STAB, and XATR for the boot filesystem volume (ignored by the kernel) * EXEC for the bootstrap program. (technically speaking, the only hard requirement as far as the kernel is concerned is the EXEC tag. The initrd could contain any number of other volumes or tags, including none at all) The boot filesystem is ignored by the kernel. It is up to the bootstrap program to make use of it. The bootstrap program is a static ELF binary in an EXEC tag with an identifier of 0x555345524C414E44 ("USERLAND" in ASCII). The key feature of the EXEC tag in an EC3 image is that, for static and flat binaries, it extracts the information needed to run the executable and stores it in a special data structure for easy parsing. This allows the reader (the kernel in this case) to load and run the executable without having to implement an ELF parser. Such information includes: * The offset and size of the read-write (.data, .bss) and read-exec (.text, .rodata) segments both in the file (source) and in virtual memory (destination). * The entry point address. The following structure can be found at the beginning of the EXEC tag. Any *_faddr variables are offsets relative to the beginning of the tag. struct ec3_exec_aux { uint8_t e_type; // EXEC_ELF, EXEC_FLAT, etc union { struct { uintptr_t rx_segment_faddr, rx_segment_vaddr; size_t rx_segment_fsize, rx_segment_vsize; uintptr_t rw_segment_faddr, rw_segment_vaddr; size_t rw_segment_fsize, rw_segment_vsize; uintptr_t entry; } i_elf; struct { uintptr_t base; uintptr_t entry; } i_flat; } e_info; } As long as you aren't reading any volumes, the EC3 image format is simple enough that finding the EXEC tag and reading its contents is a trivial operation. This minimises the amount of code needed in the kernel to find the bootstrap program. The auxiliary information in the EXEC tag is enough for the kernel to copy the executable into memory, set the appropriate memory permissions, and jump to the entry point. 3 Userland Bootstrap -------------------- The userland bootstrap program (or "userboot") is responsible for making available the boot filesystem and starting the system management task. Any kernel tasks have a negative task ID, and the userland bootstrap task will always be given a task ID of zero. Therefore, the first task spawned by userboot will always have a task ID of 1. Once the system management process is started, userboot can (but doesn't HAVE to) exit. The system management task will automatically become the root of the task tree. If userboot exits without spawning any other tasks, the action taken will depend on the command-line arguments given to the kernel. Some options include: * Shut the system down * Restart the system * Trigger a kernel panic In most cases, userboot will remain running, providing the system management task with access to the boot filesystem until other drivers are online, at which point the bootstrap program will exit. In more specialised cases, userboot can remain running for the life of the system. It can wait for the task it spawns to exit before taking some action. This is useful for automated testing. The bootstrap program can run a program that will run the test suite (or could itself be a test suite program), wait for the tests to finish, and then shut down the system. 3 System Management Task ------------------------ The system management task will be in charge of the system for the entire time the system is up. It is responsible for starting device drivers and setting up an environment for the system to carry out its intended purpose (i.e. handling interactive user sessions). Of course, the system management task can (and certainly should) delegate these tasks to other system services. On Rosetta-based systems, system management duties are handled by the systemd daemon. systemd fulfills a few important roles, including: 1) managing system services, and restarting them if they fail. 2) loading and launching executables. 3) managing the system namespace. userboot sends commands to systemd to bring up the rest of the userland environment. During this process, systemd maintains a connection to userboot to load files from the boot filesystem. You might think that having two tasks communicate with each other (violating the strict one-way client-server message flow) would result in deadlocks, but a few key design choices in userboot and systemd avoid this. technically, there is nothing wrong with two tasks waiting on each other, as long as two THREADS within those tasks don't end up (directly or indirectly) waiting on each other. therefore, to ensure that this principle is not violated: 1) systemd performs all process-launching activities and request-handling activities on separate threads that never wait on each other. when a request is received to launch a new process, systemd's request-handler thread dispatches the request (and the responsibility to respond to the client) to a separate loader thread. this allows systemd to continue servicing other requests (including filesystem requests from its own loader threads). 2) userboot performs all system startup activities (including sending commands to systemd) and filesystem request-handing activities on separate threads that never wait on each other. because of this, despite the circular communications between userboot and systemd, messages between the two tasks still technically only travel in a single direction when you consider their individual threads: userboot[init] -----> systemd[req-handler] | : ═════NO═COMMUNICATION═════ : (async task dispatch) | v userboot[fs-handler] <----- systemd[launcher] key: task-name[thread-name] ---> Request/reply exchange (the arrow points toward the request recipient) ...> Non-blocking action (e.g. scheduling another thread to run) technically, systemd[req-handler] schedules systemd[launcher] to run and doesn't wait on it. therefore, if userboot[init] sends a request to systemd[req-handler] to launch a server, it will receive a reply from systemd[launcher]. Because of the fixed order in which userboot and systemd are started, and the deterministic assignment of task IDs mentioned in the USERLAND BOOTSTRAP section, the channels that the two tasks use to communicate with each other have well-defined locations: * userboot always has TID 0, and always hosts the boot filesystem on its first channel, giving a tuple of (nd:0, tid:0, chid:0). * systemd always has TID 1, and always hosts its system management interface on its first channel, giving a tuple of (nd:0, tid:1, chid:0). 5 From Userboot to the Root Filesystem -------------------------------------- Now that we are familiar with the inner workings of these two critical tasks, lets go through the steps taken to bring up the full userland environment: 1) when userboot starts, it is given (by the kernel) a handle to a pagebuf object containing the initrd. userboot maps this pagebuf into its address space and mounts the initrd[1]. 2) userboot creates a new task to run the system management service. userboot contains just enough ELF-related code to do one of the following: * if the system management executable is statically-linked, simply copy the relevant ELF segments into the new task's address space and create a thread that will start running at the executable's entry point. * if the system management executable is dynamically-linked (the more likely scenario), load the dynamic linker[2] into the new task's address space and creates a new thread that will start running at the dynamic linker's entry point. 3) systemd initialises the system namespace and mounts the boot filesystem provided by userboot at '/', temporarily making it the root filesystem. 4) systemd starts the device manager service, emdevd, and instructs it to scan the system devices. this blocks systemd until the scan is complete. 5) in response to a scan command, emdevd uses whatever drivers are available in the current root filesystem to find and initialise as many devices as possible. because the boot filesystem only contains the drivers needed to mount the root filesystem, this scan will be far from complete, but it will be repeated once the real root filesystem is available. 6) eventually the scan will complete, and emdevd will return control back to systemd. at this point, the storage device containing the root filesystem has been found and brought online. 7) emdevd provides a devfs-like interface to all the devices on the system. systemd mounts this pseudo-filesystem at '/dev' in the system namespace. 8) systemd starts an instance of the filesystem server, fsd, and provides it with three parameters: * the path to the device node containing the root filesystem (e.g. '/dev/disk0s1') * the name of the filesystem format to be mounted (e.g. 'ext2') * the mount flags (the root filesystem is always mounted read-only during boot. once /etc/fstab is accessible, the root filesystem is re-mounted with the flags it specifies) 9) fsd will load the necessary filesystem driver (e.g. for ext2 filesystems, fsd will load fs-ext2.so) and mount the filesystem on the provided device. 10) systemd mounts the filesystem provided by fsd to the root of the system namespace. at this point, the root filesystem is now available (albeit read-only for now). Notes: [1] In this case, mounting doesn't involve the system namespace (until systemd starts up, there *is* no system namespace), but rather userboot creating any data structures it needs to be able to privately locate and read files within the boot image. [2] despite being a .so file, the dynamic linker is designed to be a self-contained position-independent executable with no external dependencies, in order to avoid a chicken-and-egg situation where the dynamic linker itself requires a dynamic linker to load. the only functionality required to load it (beyond copying its code and data into memory) is finding and iterating through the DYNAMIC segment, processing any relocation entries contained within. 6 Runlevels ----------- the state of the system, and what functionality the system has, depends on which services are running. For example: * without deviced or fsd, no filesystems are available. * without lockdownd, user authentication and authorisation is not available. * without airportd, network connectivity is not available. * without seatd, multiplexing of peripherals between multiple user sessions is not available. * without sessiond, user sessions are not available. ... and so on. different sets of services can be brought online to tailor the available functionality. under systemd, these sets of services are called runlevels. runlevels are hierarchical, with higher runlevels building upon the functionality provided by lower runlevels. as the runlevel increases, the number of system services running on the machine increases. 6.1 Pre-defined Runlevels ~~~~~~~~~~~~~~~~~~~~~~~~~ Rosetta has a range of pre-defined runlevels: * Off: - Instructing systemd to move to this runlevel will shut the system down. * Minimal: - Only the root filesystem is available, and is read-only. - All device drivers are loaded, and all devices are visible. - All network interfaces are down, and no socket I/O is possible. - The security service is offline, so no authentication or authorisation checks can be performed, and the interactive user is effectively root. - Neither the session nor seat managers are online, so only one session is supported. - A basic console and shell are started to allow the user to interact with the system. * Single-User: Same as Minimal, except: - all filesystems mounts prescribed by /etc/fstab are performed. * Multi-User: Same as Single-User, except: - The security service is running, allowing user authentication. - System security and permissions are now enforced. - The seat and session manager services are running, allowing multiple user sessions to be running simultaneously. - instead of dropping straight into a shell, the interactive user is presented with a text-based login prompt before their shell is launched. * Networking: Same as Multi-User, except: - The networking service is running, and all network interfaces are brought up and configured according to system configuration. * Full Mode: Same as Networking, except: - The system's display manager is running, allowing for logging in and interacting with the system via a graphical user interface. In most circumstances, the system will be running in one of the runlevels based on Multi-User. Not only does this enable most of the "usual" system functionality, but it also enforces user authentication and authorisation. The lower runlevels are mostly used for system administration and troubleshooting when there is a problem preventing the system from reaching a higher runlevel. 6.2 How Runlevels Affect Security Enforcement ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User authentication and authorisation depend on the system security service (lockdownd). Without it, no users can log on to the system, and no permission checks can be performed. So, how does a system behave when lockdownd isn't running? There are a few circumstances where lockdownd may be offline, some intentional and some unintentional. The system may be booted in Minimal or Single-User mode. These runlevels don't start lockdownd as the interactive user is root by default. However, lockdownd may crash while running on a multi-user system. So if you are an application or service running on a Rosetta system, and your attempt to connect to the security service fails because the service has stopped working, or was never running in the first place, what do you do? The system management service keeps track of what runlevel the system is currently running at, and anyone can contact the service to query this information. So, you can take action depending on the system runlevel: * If the runlevel is Single-User or below, you know that system security is not being enforced, so there is no need to contact the security service. * If the runlevel is Multi-User or higher, you know that system security is (or should be) enforced. If the security service cannot be reached in this case, you should wait for the system management service to (re)start it. In the worst case scenario, where the security service cannot be started, all authentication and authorisation actions should be presumed to fail, so that there is never a lapse in security. 7 From the Root Filesystem to User Interaction ---------------------------------------------- Now that the root filesystem is available, we can start bringing other system components online. This process culminates in an interactive user session. 1) systemd instructs emdevd to perform another scan of the system devices. with a wider range of drivers now available, (hopefully) all devices will now be detected and initialised. 2) systemd will now start working towards reaching a target runlevel. right now, the system is running at the Minimum runlevel. For the purposes of this document, let's assume that the target runlevel is Networking, and the system will move through the Single-User and Multi- User runlevels to get there. 3) In order to reach the Single-User runlevel, the filesystem mounts specified in /etc/fstab must be performed. The Single-User runlevel defines a script for systemd to execute, which performs the necessary mount operations. 4) The Multi-User runlevel is more complex and will require starting a range of services. 5) First, the security service, lockdownd, is brought online. This is the pivotal service that converts the system from single-user to multi-user. vim: shiftwidth=3 expandtab