add existing documentation

2024-11-02 15:09:10 +00:00
parent 1c98ae8856
commit 60130ccd55
14 changed files with 1973 additions and 0 deletions
--- a/doc/boot-process.txt
+++ b/doc/boot-process.txt
@@ -0,0 +1,375 @@
+================================================================================
+|                           Rosetta Operating System                           |
+|                           ~~~~~~~~~~~~~~~~~~~~~~~~                           |
+|                               The Boot Process                               |
+================================================================================
+
+1 Bootloader
+------------
+
+   The bootloader loads the kernel executable and an initrd.
+
+
+2 Kernel
+--------
+
+   The kernel initialises itself, extracts the bootstrap program from the initrd
+   and executes it.
+   
+   The initrd is an EC3 image containing (in most cases) two key items:
+      1) A bootstrap executable.
+      2) A volume containing the boot filesystem.
+   
+   This data is stored in several 'tags' within the container:
+      * VOLU, CTAB, STAB, and XATR for the boot filesystem volume (ignored by
+        the kernel)
+      * EXEC for the bootstrap program.
+   
+   (technically speaking, the only hard requirement as far as the kernel is
+   concerned is the EXEC tag. The initrd could contain any number of other
+   volumes or tags, including none at all)
+   
+   The boot filesystem is ignored by the kernel. It is up to the bootstrap
+   program to make use of it.
+   
+   The bootstrap program is a static ELF binary in an EXEC tag with an
+   identifier of 0x555345524C414E44 ("USERLAND" in ASCII).
+   
+   The key feature of the EXEC tag in an EC3 image is that, for static and flat
+   binaries, it extracts the information needed to run the executable and stores
+   it in a special data structure for easy parsing. This allows the reader (the
+   kernel in this case) to load and run the executable without having to
+   implement an ELF parser.
+   
+   Such information includes:
+      * The offset and size of the read-write (.data, .bss) and read-exec
+        (.text, .rodata) segments both in the file (source) and in virtual
+        memory (destination).
+      * The entry point address.
+   
+   The following structure can be found at the beginning of the EXEC tag.
+   Any *_faddr variables are offsets relative to the beginning of the tag.
+
+      struct ec3_exec_aux {
+         uint8_t e_type; // EXEC_ELF, EXEC_FLAT, etc
+
+         union {
+            struct {
+               uintptr_t rx_segment_faddr, rx_segment_vaddr;
+               size_t rx_segment_fsize, rx_segment_vsize;
+
+               uintptr_t rw_segment_faddr, rw_segment_vaddr;
+               size_t rw_segment_fsize, rw_segment_vsize;
+
+               uintptr_t entry;
+            } i_elf;
+
+            struct {
+               uintptr_t base;
+               uintptr_t entry;
+            } i_flat;
+         } e_info;
+      }
+
+   As long as you aren't reading any volumes, the EC3 image format is simple
+   enough that finding the EXEC tag and reading its contents is a trivial
+   operation. This minimises the amount of code needed in the kernel to find
+   the bootstrap program.
+
+   The auxiliary information in the EXEC tag is enough for the kernel to copy
+   the executable into memory, set the appropriate memory permissions, and
+   jump to the entry point.
+
+
+3 Userland Bootstrap
+--------------------
+
+   The userland bootstrap program (or "userboot") is responsible for making
+   available the boot filesystem and starting the system management task.
+   
+   Any kernel tasks have a negative task ID, and the userland bootstrap task 
+   will always be given a task ID of zero. Therefore, the first task spawned by
+   userboot will always have a task ID of 1.
+   
+   Once the system management process is started, userboot can (but doesn't HAVE
+   to) exit. The system management task will automatically become the root of
+   the task tree.
+   
+   If userboot exits without spawning any other tasks, the action taken will
+   depend on the command-line arguments given to the kernel.
+
+   Some options include:
+      * Shut the system down
+      * Restart the system
+      * Trigger a kernel panic
+   
+   In most cases, userboot will remain running, providing the system management
+   task with access to the boot filesystem until other drivers are online, at
+   which point the bootstrap program will exit.
+   
+   In more specialised cases, userboot can remain running for the life of the
+   system. It can wait for the task it spawns to exit before taking some action.
+   
+   This is useful for automated testing. The bootstrap program can run a program
+   that will run the test suite (or could itself be a test suite program), wait
+   for the tests to finish, and then shut down the system.
+
+
+3 System Management Task
+------------------------
+
+   The system management task will be in charge of the system for the entire 
+   time the system is up. It is responsible for starting device drivers and
+   setting up an environment for the system to carry out its intended purpose
+   (i.e. handling interactive user sessions).
+   
+   Of course, the system management task can (and certainly should) delegate
+   these tasks to other system services.
+   
+   On Rosetta-based systems, system management duties are handled by the systemd
+   daemon. systemd fulfills a few important roles, including:
+      1) managing system services, and restarting them if they fail.
+      2) loading and launching executables.
+      3) managing the system namespace.
+   
+   userboot sends commands to systemd to bring up the rest of the userland
+   environment. During this process, systemd maintains a connection to userboot
+   to load files from the boot filesystem. You might think that having two tasks
+   communicate with each other (violating the strict one-way client-server
+   message flow) would result in deadlocks, but a few key design choices in
+   userboot and systemd avoid this.
+   
+   technically, there is nothing wrong with two tasks waiting on each other, as
+   long as two THREADS within those tasks don't end up (directly or indirectly)
+   waiting on each other.
+   
+   therefore, to ensure that this principle is not violated:
+      1) systemd performs all process-launching activities and request-handling
+         activities on separate threads that never wait on each other. when a
+         request is received to launch a new process, systemd's request-handler
+         thread dispatches the request (and the responsibility to respond to the
+         client) to a separate loader thread. this allows systemd to continue
+         servicing other requests (including filesystem requests from its own
+         loader threads).
+      2) userboot performs all system startup activities (including sending
+         commands to systemd) and filesystem request-handing activities on
+         separate threads that never wait on each other.
+   
+   because of this, despite the circular communications between userboot and
+   systemd, messages between the two tasks still technically only travel in a
+   single direction when you consider their individual threads:
+   
+                      userboot[init]    ----->  systemd[req-handler]
+                             |                         :
+                ═════NO═COMMUNICATION═════             : (async task dispatch)
+                             |                         v
+                  userboot[fs-handler]  <-----   systemd[launcher]
+                           
+      key:
+         task-name[thread-name]
+         ---> Request/reply exchange (the arrow points toward the request
+              recipient)
+         ...> Non-blocking action (e.g. scheduling another thread to run)
+   
+   technically, systemd[req-handler] schedules systemd[launcher] to run and
+   doesn't wait on it. therefore, if userboot[init] sends a request to
+   systemd[req-handler] to launch a server, it will receive a reply from
+   systemd[launcher].
+
+   Because of the fixed order in which userboot and systemd are started, and
+   the deterministic assignment of task IDs mentioned in the USERLAND BOOTSTRAP
+   section, the channels that the two tasks use to communicate with each other
+   have well-defined locations:
+
+      * userboot always has TID 0, and always hosts the boot filesystem on its
+        first channel, giving a tuple of (nd:0, tid:0, chid:0).
+      * systemd always has TID 1, and always hosts its system management
+        interface on its first channel, giving a tuple of (nd:0, tid:1, chid:0).
+   
+
+5 From Userboot to the Root Filesystem
+--------------------------------------
+
+   Now that we are familiar with the inner workings of these two critical tasks,
+   lets go through the steps taken to bring up the full userland environment:
+   
+      1) when userboot starts, it is given (by the kernel) a handle to a pagebuf
+         object containing the initrd. userboot maps this pagebuf into its
+         address space and mounts the initrd[1].
+      2) userboot creates a new task to run the system management service.
+         userboot contains just enough ELF-related code to do one of the
+         following:
+         * if the system management executable is statically-linked, simply copy
+           the relevant ELF segments into the new task's address space and 
+           create a thread that will start running at the executable's entry
+           point.
+         * if the system management executable is dynamically-linked (the more
+           likely scenario), load the dynamic linker[2] into the new task's
+           address space and creates a new thread that will start running at the
+           dynamic linker's entry point.
+      3) systemd initialises the system namespace and mounts the boot filesystem
+         provided by userboot at '/', temporarily making it the root filesystem.
+      4) systemd starts the device manager service, emdevd, and instructs it
+         to scan the system devices. this blocks systemd until the scan is
+         complete.
+      5) in response to a scan command, emdevd uses whatever drivers are
+         available in the current root filesystem to find and initialise as many
+         devices as possible. because the boot filesystem only contains the
+         drivers needed to mount the root filesystem, this scan will be
+         far from complete, but it will be repeated once the real root
+         filesystem is available.
+      6) eventually the scan will complete, and emdevd will return control
+         back to systemd. at this point, the storage device containing the
+         root filesystem has been found and brought online.
+      7) emdevd provides a devfs-like interface to all the devices on the
+         system. systemd mounts this pseudo-filesystem at '/dev' in the
+         system namespace.
+      8) systemd starts an instance of the filesystem server, fsd, and provides
+         it with three parameters:
+         * the path to the device node containing the root filesystem (e.g.
+           '/dev/disk0s1')
+         * the name of the filesystem format to be mounted (e.g. 'ext2')
+         * the mount flags (the root filesystem is always mounted read-only
+           during boot. once /etc/fstab is accessible, the root filesystem
+           is re-mounted with the flags it specifies)
+      9) fsd will load the necessary filesystem driver (e.g. for ext2
+         filesystems, fsd will load fs-ext2.so) and mount the filesystem
+         on the provided device.
+     10) systemd mounts the filesystem provided by fsd to the root of
+         the system namespace. at this point, the root filesystem is now
+         available (albeit read-only for now).
+
+   Notes:
+     [1] In this case, mounting doesn't involve the system namespace (until
+         systemd starts up, there *is* no system namespace), but rather
+         userboot creating any data structures it needs to be able to privately
+         locate and read files within the boot image.
+     [2] despite being a .so file, the dynamic linker is designed to be a
+         self-contained position-independent executable with no external
+         dependencies, in order to avoid a chicken-and-egg situation where the
+         dynamic linker itself requires a dynamic linker to load. the only
+         functionality required to load it (beyond copying its code and data
+         into memory) is finding and iterating through the DYNAMIC segment,
+         processing any relocation entries contained within.
+
+
+6 Runlevels
+-----------
+
+   the state of the system, and what functionality the system has, depends on
+   which services are running. For example:
+      * without deviced or fsd, no filesystems are available.
+      * without lockdownd, user authentication and authorisation is not
+        available.
+      * without airportd, network connectivity is not available.
+      * without seatd, multiplexing of peripherals between multiple user
+        sessions is not available.
+      * without sessiond, user sessions are not available.
+      ... and so on.
+
+   different sets of services can be brought online to tailor the available
+   functionality. under systemd, these sets of services are called runlevels.
+   runlevels are hierarchical, with higher runlevels building upon the
+   functionality provided by lower runlevels. as the runlevel increases, the
+   number of system services running on the machine increases.
+
+
+   6.1 Pre-defined Runlevels
+   ~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   Rosetta has a range of pre-defined runlevels:
+      * Off:
+        - Instructing systemd to move to this runlevel will shut the system down.
+      * Minimal:
+        - Only the root filesystem is available, and is read-only.
+        - All device drivers are loaded, and all devices are visible.
+        - All network interfaces are down, and no socket I/O is possible.
+        - The security service is offline, so no authentication or authorisation
+          checks can be performed, and the interactive user is effectively root.
+        - Neither the session nor seat managers are online, so only one session
+          is supported.
+        - A basic console and shell are started to allow the user to interact
+          with the system.
+      * Single-User: Same as Minimal, except:
+        - all filesystems mounts prescribed by /etc/fstab are performed.
+      * Multi-User: Same as Single-User, except:
+        - The security service is running, allowing user authentication.
+        - System security and permissions are now enforced.
+        - The seat and session manager services are running, allowing multiple
+          user sessions to be running simultaneously.
+        - instead of dropping straight into a shell, the interactive user is
+          presented with a text-based login prompt before their shell is
+          launched.
+      * Networking: Same as Multi-User, except:
+        - The networking service is running, and all network interfaces are
+          brought up and configured according to system configuration.
+      * Full Mode: Same as Networking, except:
+        - The system's display manager is running, allowing for logging in
+          and interacting with the system via a graphical user interface.
+
+   In most circumstances, the system will be running in one of the runlevels
+   based on Multi-User. Not only does this enable most of the "usual" system
+   functionality, but it also enforces user authentication and authorisation.
+   The lower runlevels are mostly used for system administration and
+   troubleshooting when there is a problem preventing the system from reaching
+   a higher runlevel.
+
+
+   6.2 How Runlevels Affect Security Enforcement
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   User authentication and authorisation depend on the system security service
+   (lockdownd). Without it, no users can log on to the system, and no permission
+   checks can be performed. So, how does a system behave when lockdownd isn't
+   running?
+
+   There are a few circumstances where lockdownd may be offline, some
+   intentional and some unintentional. The system may be booted in Minimal or
+   Single-User mode. These runlevels don't start lockdownd as the interactive
+   user is root by default. However, lockdownd may crash while running on a
+   multi-user system.
+
+   So if you are an application or service running on a Rosetta system, and your
+   attempt to connect to the security service fails because the service has
+   stopped working, or was never running in the first place, what do you do?
+
+   The system management service keeps track of what runlevel the system is
+   currently running at, and anyone can contact the service to query this
+   information. So, you can take action depending on the system runlevel:
+      * If the runlevel is Single-User or below, you know that system security
+        is not being enforced, so there is no need to contact the security
+        service.
+      * If the runlevel is Multi-User or higher, you know that system security
+        is (or should be) enforced. If the security service cannot be reached
+        in this case, you should wait for the system management service to
+        (re)start it. In the worst case scenario, where the security service
+        cannot be started, all authentication and authorisation actions should
+        be presumed to fail, so that there is never a lapse in security.
+   
+
+7 From the Root Filesystem to User Interaction
+----------------------------------------------
+
+   Now that the root filesystem is available, we can start bringing other
+   system components online. This process culminates in an interactive user
+   session.
+
+      1) systemd instructs emdevd to perform another scan of the system devices.
+         with a wider range of drivers now available, (hopefully) all devices
+         will now be detected and initialised.
+      2) systemd will now start working towards reaching a target runlevel.
+         right now, the system is running at the Minimum runlevel. For the
+         purposes of this document, let's assume that the target runlevel is
+         Networking, and the system will move through the Single-User and Multi-
+         User runlevels to get there.
+      3) In order to reach the Single-User runlevel, the filesystem mounts 
+         specified in /etc/fstab must be performed. The Single-User runlevel
+         defines a script for systemd to execute, which performs the necessary
+         mount operations.
+      4) The Multi-User runlevel is more complex and will require starting a
+         range of services.
+      5) First, the security service, lockdownd, is brought online. This is the
+         pivotal service that converts the system from single-user to multi-user.
+
+
+vim: shiftwidth=3 expandtab