rosetta/doc/boot-process.txt

================================================================================
|                           Rosetta Operating System                           |
|                           ~~~~~~~~~~~~~~~~~~~~~~~~                           |
|                               The Boot Process                               |
================================================================================

1 Bootloader
------------

   The bootloader loads the kernel executable and an initrd.


2 Kernel
--------

   The kernel initialises itself, extracts the bootstrap program from the initrd
   and executes it.

   The initrd is an EC3 image containing (in most cases) two key items:
      1) A bootstrap executable.
      2) A volume containing the boot filesystem.

   This data is stored in several 'tags' within the container:
      * VOLU, CTAB, STAB, and XATR for the boot filesystem volume (ignored by
        the kernel)
      * EXEC for the bootstrap program.

   (technically speaking, the only hard requirement as far as the kernel is
   concerned is the EXEC tag. The initrd could contain any number of other
   volumes or tags, including none at all)

   The boot filesystem is ignored by the kernel. It is up to the bootstrap
   program to make use of it.

   The bootstrap program is a static ELF binary in an EXEC tag with an
   identifier of 0x555345524C414E44 ("USERLAND" in ASCII).

   The key feature of the EXEC tag in an EC3 image is that, for static and flat
   binaries, it extracts the information needed to run the executable and stores
   it in a special data structure for easy parsing. This allows the reader (the
   kernel in this case) to load and run the executable without having to
   implement an ELF parser.

   Such information includes:
      * The offset and size of the read-write (.data, .bss) and read-exec
        (.text, .rodata) segments both in the file (source) and in virtual
        memory (destination).
      * The entry point address.

   The following structure can be found at the beginning of the EXEC tag.
   Any *_faddr variables are offsets relative to the beginning of the tag.

      struct ec3_exec_aux {
         uint8_t e_type; // EXEC_ELF, EXEC_FLAT, etc

         union {
            struct {
               uintptr_t rx_segment_faddr, rx_segment_vaddr;
               size_t rx_segment_fsize, rx_segment_vsize;

               uintptr_t rw_segment_faddr, rw_segment_vaddr;
               size_t rw_segment_fsize, rw_segment_vsize;

               uintptr_t entry;
            } i_elf;

            struct {
               uintptr_t base;
               uintptr_t entry;
            } i_flat;
         } e_info;
      }

   As long as you aren't reading any volumes, the EC3 image format is simple
   enough that finding the EXEC tag and reading its contents is a trivial
   operation. This minimises the amount of code needed in the kernel to find
   the bootstrap program.

   The auxiliary information in the EXEC tag is enough for the kernel to copy
   the executable into memory, set the appropriate memory permissions, and
   jump to the entry point.


3 Userland Bootstrap
--------------------

   The userland bootstrap program (or "userboot") is responsible for making
   available the boot filesystem and starting the system management task.

   Any kernel tasks have a negative task ID, and the userland bootstrap task
   will always be given a task ID of zero. Therefore, the first task spawned by
   userboot will always have a task ID of 1.

   Once the system management process is started, userboot can (but doesn't HAVE
   to) exit. The system management task will automatically become the root of
   the task tree.

   If userboot exits without spawning any other tasks, the action taken will
   depend on the command-line arguments given to the kernel.

   Some options include:
      * Shut the system down
      * Restart the system
      * Trigger a kernel panic

   In most cases, userboot will remain running, providing the system management
   task with access to the boot filesystem until other drivers are online, at
   which point the bootstrap program will exit.

   In more specialised cases, userboot can remain running for the life of the
   system. It can wait for the task it spawns to exit before taking some action.

   This is useful for automated testing. The bootstrap program can run a program
   that will run the test suite (or could itself be a test suite program), wait
   for the tests to finish, and then shut down the system.


3 System Management Task
------------------------

   The system management task will be in charge of the system for the entire
   time the system is up. It is responsible for starting device drivers and
   setting up an environment for the system to carry out its intended purpose
   (i.e. handling interactive user sessions).

   Of course, the system management task can (and certainly should) delegate
   these tasks to other system services.

   On Rosetta-based systems, system management duties are handled by the systemd
   daemon. systemd fulfills a few important roles, including:
      1) managing system services, and restarting them if they fail.
      2) loading and launching executables.
      3) managing the system namespace.

   userboot sends commands to systemd to bring up the rest of the userland
   environment. During this process, systemd maintains a connection to userboot
   to load files from the boot filesystem. You might think that having two tasks
   communicate with each other (violating the strict one-way client-server
   message flow) would result in deadlocks, but a few key design choices in
   userboot and systemd avoid this.

   technically, there is nothing wrong with two tasks waiting on each other, as
   long as two THREADS within those tasks don't end up (directly or indirectly)
   waiting on each other.

   therefore, to ensure that this principle is not violated:
      1) systemd performs all process-launching activities and request-handling
         activities on separate threads that never wait on each other. when a
         request is received to launch a new process, systemd's request-handler
         thread dispatches the request (and the responsibility to respond to the
         client) to a separate loader thread. this allows systemd to continue
         servicing other requests (including filesystem requests from its own
         loader threads).
      2) userboot performs all system startup activities (including sending
         commands to systemd) and filesystem request-handing activities on
         separate threads that never wait on each other.

   because of this, despite the circular communications between userboot and
   systemd, messages between the two tasks still technically only travel in a
   single direction when you consider their individual threads:

                      userboot[init]    ----->  systemd[req-handler]
                             |                         :
                ═════NO═COMMUNICATION═════             : (async task dispatch)
                             |                         v
                  userboot[fs-handler]  <-----   systemd[launcher]

      key:
         task-name[thread-name]
         ---> Request/reply exchange (the arrow points toward the request
              recipient)
         ...> Non-blocking action (e.g. scheduling another thread to run)

   technically, systemd[req-handler] schedules systemd[launcher] to run and
   doesn't wait on it. therefore, if userboot[init] sends a request to
   systemd[req-handler] to launch a server, it will receive a reply from
   systemd[launcher].

   Because of the fixed order in which userboot and systemd are started, and
   the deterministic assignment of task IDs mentioned in the USERLAND BOOTSTRAP
   section, the channels that the two tasks use to communicate with each other
   have well-defined locations:

      * userboot always has TID 0, and always hosts the boot filesystem on its
        first channel, giving a tuple of (nd:0, tid:0, chid:0).
      * systemd always has TID 1, and always hosts its system management
        interface on its first channel, giving a tuple of (nd:0, tid:1, chid:0).


5 From Userboot to the Root Filesystem
--------------------------------------

   Now that we are familiar with the inner workings of these two critical tasks,
   lets go through the steps taken to bring up the full userland environment:

      1) when userboot starts, it is given (by the kernel) a handle to a pagebuf
         object containing the initrd. userboot maps this pagebuf into its
         address space and mounts the initrd[1].
      2) userboot creates a new task to run the system management service.
         userboot contains just enough ELF-related code to do one of the
         following:
         * if the system management executable is statically-linked, simply copy
           the relevant ELF segments into the new task's address space and
           create a thread that will start running at the executable's entry
           point.
         * if the system management executable is dynamically-linked (the more
           likely scenario), load the dynamic linker[2] into the new task's
           address space and creates a new thread that will start running at the
           dynamic linker's entry point.
      3) systemd initialises the system namespace and mounts the boot filesystem
         provided by userboot at '/', temporarily making it the root filesystem.
      4) systemd starts the device manager service, emdevd, and instructs it
         to scan the system devices. this blocks systemd until the scan is
         complete.
      5) in response to a scan command, emdevd uses whatever drivers are
         available in the current root filesystem to find and initialise as many
         devices as possible. because the boot filesystem only contains the
         drivers needed to mount the root filesystem, this scan will be
         far from complete, but it will be repeated once the real root
         filesystem is available.
      6) eventually the scan will complete, and emdevd will return control
         back to systemd. at this point, the storage device containing the
         root filesystem has been found and brought online.
      7) emdevd provides a devfs-like interface to all the devices on the
         system. systemd mounts this pseudo-filesystem at '/dev' in the
         system namespace.
      8) systemd starts an instance of the filesystem server, fsd, and provides
         it with three parameters:
         * the path to the device node containing the root filesystem (e.g.
           '/dev/disk0s1')
         * the name of the filesystem format to be mounted (e.g. 'ext2')
         * the mount flags (the root filesystem is always mounted read-only
           during boot. once /etc/fstab is accessible, the root filesystem
           is re-mounted with the flags it specifies)
      9) fsd will load the necessary filesystem driver (e.g. for ext2
         filesystems, fsd will load fs-ext2.so) and mount the filesystem
         on the provided device.
     10) systemd mounts the filesystem provided by fsd to the root of
         the system namespace. at this point, the root filesystem is now
         available (albeit read-only for now).

   Notes:
     [1] In this case, mounting doesn't involve the system namespace (until
         systemd starts up, there *is* no system namespace), but rather
         userboot creating any data structures it needs to be able to privately
         locate and read files within the boot image.
     [2] despite being a .so file, the dynamic linker is designed to be a
         self-contained position-independent executable with no external
         dependencies, in order to avoid a chicken-and-egg situation where the
         dynamic linker itself requires a dynamic linker to load. the only
         functionality required to load it (beyond copying its code and data
         into memory) is finding and iterating through the DYNAMIC segment,
         processing any relocation entries contained within.


6 Runlevels
-----------

   the state of the system, and what functionality the system has, depends on
   which services are running. For example:
      * without deviced or fsd, no filesystems are available.
      * without lockdownd, user authentication and authorisation is not
        available.
      * without airportd, network connectivity is not available.
      * without seatd, multiplexing of peripherals between multiple user
        sessions is not available.
      * without sessiond, user sessions are not available.
      ... and so on.

   different sets of services can be brought online to tailor the available
   functionality. under systemd, these sets of services are called runlevels.
   runlevels are hierarchical, with higher runlevels building upon the
   functionality provided by lower runlevels. as the runlevel increases, the
   number of system services running on the machine increases.


   6.1 Pre-defined Runlevels
   ~~~~~~~~~~~~~~~~~~~~~~~~~

   Rosetta has a range of pre-defined runlevels:
      * Off:
        - Instructing systemd to move to this runlevel will shut the system down.
      * Minimal:
        - Only the root filesystem is available, and is read-only.
        - All device drivers are loaded, and all devices are visible.
        - All network interfaces are down, and no socket I/O is possible.
        - The security service is offline, so no authentication or authorisation
          checks can be performed, and the interactive user is effectively root.
        - Neither the session nor seat managers are online, so only one session
          is supported.
        - A basic console and shell are started to allow the user to interact
          with the system.
      * Single-User: Same as Minimal, except:
        - all filesystems mounts prescribed by /etc/fstab are performed.
      * Multi-User: Same as Single-User, except:
        - The security service is running, allowing user authentication.
        - System security and permissions are now enforced.
        - The seat and session manager services are running, allowing multiple
          user sessions to be running simultaneously.
        - instead of dropping straight into a shell, the interactive user is
          presented with a text-based login prompt before their shell is
          launched.
      * Networking: Same as Multi-User, except:
        - The networking service is running, and all network interfaces are
          brought up and configured according to system configuration.
      * Full Mode: Same as Networking, except:
        - The system's display manager is running, allowing for logging in
          and interacting with the system via a graphical user interface.

   In most circumstances, the system will be running in one of the runlevels
   based on Multi-User. Not only does this enable most of the "usual" system
   functionality, but it also enforces user authentication and authorisation.
   The lower runlevels are mostly used for system administration and
   troubleshooting when there is a problem preventing the system from reaching
   a higher runlevel.


   6.2 How Runlevels Affect Security Enforcement
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   User authentication and authorisation depend on the system security service
   (lockdownd). Without it, no users can log on to the system, and no permission
   checks can be performed. So, how does a system behave when lockdownd isn't
   running?

   There are a few circumstances where lockdownd may be offline, some
   intentional and some unintentional. The system may be booted in Minimal or
   Single-User mode. These runlevels don't start lockdownd as the interactive
   user is root by default. However, lockdownd may crash while running on a
   multi-user system.

   So if you are an application or service running on a Rosetta system, and your
   attempt to connect to the security service fails because the service has
   stopped working, or was never running in the first place, what do you do?

   The system management service keeps track of what runlevel the system is
   currently running at, and anyone can contact the service to query this
   information. So, you can take action depending on the system runlevel:
      * If the runlevel is Single-User or below, you know that system security
        is not being enforced, so there is no need to contact the security
        service.
      * If the runlevel is Multi-User or higher, you know that system security
        is (or should be) enforced. If the security service cannot be reached
        in this case, you should wait for the system management service to
        (re)start it. In the worst case scenario, where the security service
        cannot be started, all authentication and authorisation actions should
        be presumed to fail, so that there is never a lapse in security.


7 From the Root Filesystem to User Interaction
----------------------------------------------

   Now that the root filesystem is available, we can start bringing other
   system components online. This process culminates in an interactive user
   session.

      1) systemd instructs emdevd to perform another scan of the system devices.
         with a wider range of drivers now available, (hopefully) all devices
         will now be detected and initialised.
      2) systemd will now start working towards reaching a target runlevel.
         right now, the system is running at the Minimum runlevel. For the
         purposes of this document, let's assume that the target runlevel is
         Networking, and the system will move through the Single-User and Multi-
         User runlevels to get there.
      3) In order to reach the Single-User runlevel, the filesystem mounts
         specified in /etc/fstab must be performed. The Single-User runlevel
         defines a script for systemd to execute, which performs the necessary
         mount operations.
      4) The Multi-User runlevel is more complex and will require starting a
         range of services.
      5) First, the security service, lockdownd, is brought online. This is the
         pivotal service that converts the system from single-user to multi-user.


vim: shiftwidth=3 expandtab