376 lines
18 KiB
Plaintext
Executable File
376 lines
18 KiB
Plaintext
Executable File
================================================================================
|
|
| Rosetta Operating System |
|
|
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| The Boot Process |
|
|
================================================================================
|
|
|
|
1 Bootloader
|
|
------------
|
|
|
|
The bootloader loads the kernel executable and an initrd.
|
|
|
|
|
|
2 Kernel
|
|
--------
|
|
|
|
The kernel initialises itself, extracts the bootstrap program from the initrd
|
|
and executes it.
|
|
|
|
The initrd is an EC3 image containing (in most cases) two key items:
|
|
1) A bootstrap executable.
|
|
2) A volume containing the boot filesystem.
|
|
|
|
This data is stored in several 'tags' within the container:
|
|
* VOLU, CTAB, STAB, and XATR for the boot filesystem volume (ignored by
|
|
the kernel)
|
|
* EXEC for the bootstrap program.
|
|
|
|
(technically speaking, the only hard requirement as far as the kernel is
|
|
concerned is the EXEC tag. The initrd could contain any number of other
|
|
volumes or tags, including none at all)
|
|
|
|
The boot filesystem is ignored by the kernel. It is up to the bootstrap
|
|
program to make use of it.
|
|
|
|
The bootstrap program is a static ELF binary in an EXEC tag with an
|
|
identifier of 0x555345524C414E44 ("USERLAND" in ASCII).
|
|
|
|
The key feature of the EXEC tag in an EC3 image is that, for static and flat
|
|
binaries, it extracts the information needed to run the executable and stores
|
|
it in a special data structure for easy parsing. This allows the reader (the
|
|
kernel in this case) to load and run the executable without having to
|
|
implement an ELF parser.
|
|
|
|
Such information includes:
|
|
* The offset and size of the read-write (.data, .bss) and read-exec
|
|
(.text, .rodata) segments both in the file (source) and in virtual
|
|
memory (destination).
|
|
* The entry point address.
|
|
|
|
The following structure can be found at the beginning of the EXEC tag.
|
|
Any *_faddr variables are offsets relative to the beginning of the tag.
|
|
|
|
struct ec3_exec_aux {
|
|
uint8_t e_type; // EXEC_ELF, EXEC_FLAT, etc
|
|
|
|
union {
|
|
struct {
|
|
uintptr_t rx_segment_faddr, rx_segment_vaddr;
|
|
size_t rx_segment_fsize, rx_segment_vsize;
|
|
|
|
uintptr_t rw_segment_faddr, rw_segment_vaddr;
|
|
size_t rw_segment_fsize, rw_segment_vsize;
|
|
|
|
uintptr_t entry;
|
|
} i_elf;
|
|
|
|
struct {
|
|
uintptr_t base;
|
|
uintptr_t entry;
|
|
} i_flat;
|
|
} e_info;
|
|
}
|
|
|
|
As long as you aren't reading any volumes, the EC3 image format is simple
|
|
enough that finding the EXEC tag and reading its contents is a trivial
|
|
operation. This minimises the amount of code needed in the kernel to find
|
|
the bootstrap program.
|
|
|
|
The auxiliary information in the EXEC tag is enough for the kernel to copy
|
|
the executable into memory, set the appropriate memory permissions, and
|
|
jump to the entry point.
|
|
|
|
|
|
3 Userland Bootstrap
|
|
--------------------
|
|
|
|
The userland bootstrap program (or "userboot") is responsible for making
|
|
available the boot filesystem and starting the system management task.
|
|
|
|
Any kernel tasks have a negative task ID, and the userland bootstrap task
|
|
will always be given a task ID of zero. Therefore, the first task spawned by
|
|
userboot will always have a task ID of 1.
|
|
|
|
Once the system management process is started, userboot can (but doesn't HAVE
|
|
to) exit. The system management task will automatically become the root of
|
|
the task tree.
|
|
|
|
If userboot exits without spawning any other tasks, the action taken will
|
|
depend on the command-line arguments given to the kernel.
|
|
|
|
Some options include:
|
|
* Shut the system down
|
|
* Restart the system
|
|
* Trigger a kernel panic
|
|
|
|
In most cases, userboot will remain running, providing the system management
|
|
task with access to the boot filesystem until other drivers are online, at
|
|
which point the bootstrap program will exit.
|
|
|
|
In more specialised cases, userboot can remain running for the life of the
|
|
system. It can wait for the task it spawns to exit before taking some action.
|
|
|
|
This is useful for automated testing. The bootstrap program can run a program
|
|
that will run the test suite (or could itself be a test suite program), wait
|
|
for the tests to finish, and then shut down the system.
|
|
|
|
|
|
3 System Management Task
|
|
------------------------
|
|
|
|
The system management task will be in charge of the system for the entire
|
|
time the system is up. It is responsible for starting device drivers and
|
|
setting up an environment for the system to carry out its intended purpose
|
|
(i.e. handling interactive user sessions).
|
|
|
|
Of course, the system management task can (and certainly should) delegate
|
|
these tasks to other system services.
|
|
|
|
On Rosetta-based systems, system management duties are handled by the systemd
|
|
daemon. systemd fulfills a few important roles, including:
|
|
1) managing system services, and restarting them if they fail.
|
|
2) loading and launching executables.
|
|
3) managing the system namespace.
|
|
|
|
userboot sends commands to systemd to bring up the rest of the userland
|
|
environment. During this process, systemd maintains a connection to userboot
|
|
to load files from the boot filesystem. You might think that having two tasks
|
|
communicate with each other (violating the strict one-way client-server
|
|
message flow) would result in deadlocks, but a few key design choices in
|
|
userboot and systemd avoid this.
|
|
|
|
technically, there is nothing wrong with two tasks waiting on each other, as
|
|
long as two THREADS within those tasks don't end up (directly or indirectly)
|
|
waiting on each other.
|
|
|
|
therefore, to ensure that this principle is not violated:
|
|
1) systemd performs all process-launching activities and request-handling
|
|
activities on separate threads that never wait on each other. when a
|
|
request is received to launch a new process, systemd's request-handler
|
|
thread dispatches the request (and the responsibility to respond to the
|
|
client) to a separate loader thread. this allows systemd to continue
|
|
servicing other requests (including filesystem requests from its own
|
|
loader threads).
|
|
2) userboot performs all system startup activities (including sending
|
|
commands to systemd) and filesystem request-handing activities on
|
|
separate threads that never wait on each other.
|
|
|
|
because of this, despite the circular communications between userboot and
|
|
systemd, messages between the two tasks still technically only travel in a
|
|
single direction when you consider their individual threads:
|
|
|
|
userboot[init] -----> systemd[req-handler]
|
|
| :
|
|
═════NO═COMMUNICATION═════ : (async task dispatch)
|
|
| v
|
|
userboot[fs-handler] <----- systemd[launcher]
|
|
|
|
key:
|
|
task-name[thread-name]
|
|
---> Request/reply exchange (the arrow points toward the request
|
|
recipient)
|
|
...> Non-blocking action (e.g. scheduling another thread to run)
|
|
|
|
technically, systemd[req-handler] schedules systemd[launcher] to run and
|
|
doesn't wait on it. therefore, if userboot[init] sends a request to
|
|
systemd[req-handler] to launch a server, it will receive a reply from
|
|
systemd[launcher].
|
|
|
|
Because of the fixed order in which userboot and systemd are started, and
|
|
the deterministic assignment of task IDs mentioned in the USERLAND BOOTSTRAP
|
|
section, the channels that the two tasks use to communicate with each other
|
|
have well-defined locations:
|
|
|
|
* userboot always has TID 0, and always hosts the boot filesystem on its
|
|
first channel, giving a tuple of (nd:0, tid:0, chid:0).
|
|
* systemd always has TID 1, and always hosts its system management
|
|
interface on its first channel, giving a tuple of (nd:0, tid:1, chid:0).
|
|
|
|
|
|
5 From Userboot to the Root Filesystem
|
|
--------------------------------------
|
|
|
|
Now that we are familiar with the inner workings of these two critical tasks,
|
|
lets go through the steps taken to bring up the full userland environment:
|
|
|
|
1) when userboot starts, it is given (by the kernel) a handle to a pagebuf
|
|
object containing the initrd. userboot maps this pagebuf into its
|
|
address space and mounts the initrd[1].
|
|
2) userboot creates a new task to run the system management service.
|
|
userboot contains just enough ELF-related code to do one of the
|
|
following:
|
|
* if the system management executable is statically-linked, simply copy
|
|
the relevant ELF segments into the new task's address space and
|
|
create a thread that will start running at the executable's entry
|
|
point.
|
|
* if the system management executable is dynamically-linked (the more
|
|
likely scenario), load the dynamic linker[2] into the new task's
|
|
address space and creates a new thread that will start running at the
|
|
dynamic linker's entry point.
|
|
3) systemd initialises the system namespace and mounts the boot filesystem
|
|
provided by userboot at '/', temporarily making it the root filesystem.
|
|
4) systemd starts the device manager service, emdevd, and instructs it
|
|
to scan the system devices. this blocks systemd until the scan is
|
|
complete.
|
|
5) in response to a scan command, emdevd uses whatever drivers are
|
|
available in the current root filesystem to find and initialise as many
|
|
devices as possible. because the boot filesystem only contains the
|
|
drivers needed to mount the root filesystem, this scan will be
|
|
far from complete, but it will be repeated once the real root
|
|
filesystem is available.
|
|
6) eventually the scan will complete, and emdevd will return control
|
|
back to systemd. at this point, the storage device containing the
|
|
root filesystem has been found and brought online.
|
|
7) emdevd provides a devfs-like interface to all the devices on the
|
|
system. systemd mounts this pseudo-filesystem at '/dev' in the
|
|
system namespace.
|
|
8) systemd starts an instance of the filesystem server, fsd, and provides
|
|
it with three parameters:
|
|
* the path to the device node containing the root filesystem (e.g.
|
|
'/dev/disk0s1')
|
|
* the name of the filesystem format to be mounted (e.g. 'ext2')
|
|
* the mount flags (the root filesystem is always mounted read-only
|
|
during boot. once /etc/fstab is accessible, the root filesystem
|
|
is re-mounted with the flags it specifies)
|
|
9) fsd will load the necessary filesystem driver (e.g. for ext2
|
|
filesystems, fsd will load fs-ext2.so) and mount the filesystem
|
|
on the provided device.
|
|
10) systemd mounts the filesystem provided by fsd to the root of
|
|
the system namespace. at this point, the root filesystem is now
|
|
available (albeit read-only for now).
|
|
|
|
Notes:
|
|
[1] In this case, mounting doesn't involve the system namespace (until
|
|
systemd starts up, there *is* no system namespace), but rather
|
|
userboot creating any data structures it needs to be able to privately
|
|
locate and read files within the boot image.
|
|
[2] despite being a .so file, the dynamic linker is designed to be a
|
|
self-contained position-independent executable with no external
|
|
dependencies, in order to avoid a chicken-and-egg situation where the
|
|
dynamic linker itself requires a dynamic linker to load. the only
|
|
functionality required to load it (beyond copying its code and data
|
|
into memory) is finding and iterating through the DYNAMIC segment,
|
|
processing any relocation entries contained within.
|
|
|
|
|
|
6 Runlevels
|
|
-----------
|
|
|
|
the state of the system, and what functionality the system has, depends on
|
|
which services are running. For example:
|
|
* without deviced or fsd, no filesystems are available.
|
|
* without lockdownd, user authentication and authorisation is not
|
|
available.
|
|
* without airportd, network connectivity is not available.
|
|
* without seatd, multiplexing of peripherals between multiple user
|
|
sessions is not available.
|
|
* without sessiond, user sessions are not available.
|
|
... and so on.
|
|
|
|
different sets of services can be brought online to tailor the available
|
|
functionality. under systemd, these sets of services are called runlevels.
|
|
runlevels are hierarchical, with higher runlevels building upon the
|
|
functionality provided by lower runlevels. as the runlevel increases, the
|
|
number of system services running on the machine increases.
|
|
|
|
|
|
6.1 Pre-defined Runlevels
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Rosetta has a range of pre-defined runlevels:
|
|
* Off:
|
|
- Instructing systemd to move to this runlevel will shut the system down.
|
|
* Minimal:
|
|
- Only the root filesystem is available, and is read-only.
|
|
- All device drivers are loaded, and all devices are visible.
|
|
- All network interfaces are down, and no socket I/O is possible.
|
|
- The security service is offline, so no authentication or authorisation
|
|
checks can be performed, and the interactive user is effectively root.
|
|
- Neither the session nor seat managers are online, so only one session
|
|
is supported.
|
|
- A basic console and shell are started to allow the user to interact
|
|
with the system.
|
|
* Single-User: Same as Minimal, except:
|
|
- all filesystems mounts prescribed by /etc/fstab are performed.
|
|
* Multi-User: Same as Single-User, except:
|
|
- The security service is running, allowing user authentication.
|
|
- System security and permissions are now enforced.
|
|
- The seat and session manager services are running, allowing multiple
|
|
user sessions to be running simultaneously.
|
|
- instead of dropping straight into a shell, the interactive user is
|
|
presented with a text-based login prompt before their shell is
|
|
launched.
|
|
* Networking: Same as Multi-User, except:
|
|
- The networking service is running, and all network interfaces are
|
|
brought up and configured according to system configuration.
|
|
* Full Mode: Same as Networking, except:
|
|
- The system's display manager is running, allowing for logging in
|
|
and interacting with the system via a graphical user interface.
|
|
|
|
In most circumstances, the system will be running in one of the runlevels
|
|
based on Multi-User. Not only does this enable most of the "usual" system
|
|
functionality, but it also enforces user authentication and authorisation.
|
|
The lower runlevels are mostly used for system administration and
|
|
troubleshooting when there is a problem preventing the system from reaching
|
|
a higher runlevel.
|
|
|
|
|
|
6.2 How Runlevels Affect Security Enforcement
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
User authentication and authorisation depend on the system security service
|
|
(lockdownd). Without it, no users can log on to the system, and no permission
|
|
checks can be performed. So, how does a system behave when lockdownd isn't
|
|
running?
|
|
|
|
There are a few circumstances where lockdownd may be offline, some
|
|
intentional and some unintentional. The system may be booted in Minimal or
|
|
Single-User mode. These runlevels don't start lockdownd as the interactive
|
|
user is root by default. However, lockdownd may crash while running on a
|
|
multi-user system.
|
|
|
|
So if you are an application or service running on a Rosetta system, and your
|
|
attempt to connect to the security service fails because the service has
|
|
stopped working, or was never running in the first place, what do you do?
|
|
|
|
The system management service keeps track of what runlevel the system is
|
|
currently running at, and anyone can contact the service to query this
|
|
information. So, you can take action depending on the system runlevel:
|
|
* If the runlevel is Single-User or below, you know that system security
|
|
is not being enforced, so there is no need to contact the security
|
|
service.
|
|
* If the runlevel is Multi-User or higher, you know that system security
|
|
is (or should be) enforced. If the security service cannot be reached
|
|
in this case, you should wait for the system management service to
|
|
(re)start it. In the worst case scenario, where the security service
|
|
cannot be started, all authentication and authorisation actions should
|
|
be presumed to fail, so that there is never a lapse in security.
|
|
|
|
|
|
7 From the Root Filesystem to User Interaction
|
|
----------------------------------------------
|
|
|
|
Now that the root filesystem is available, we can start bringing other
|
|
system components online. This process culminates in an interactive user
|
|
session.
|
|
|
|
1) systemd instructs emdevd to perform another scan of the system devices.
|
|
with a wider range of drivers now available, (hopefully) all devices
|
|
will now be detected and initialised.
|
|
2) systemd will now start working towards reaching a target runlevel.
|
|
right now, the system is running at the Minimum runlevel. For the
|
|
purposes of this document, let's assume that the target runlevel is
|
|
Networking, and the system will move through the Single-User and Multi-
|
|
User runlevels to get there.
|
|
3) In order to reach the Single-User runlevel, the filesystem mounts
|
|
specified in /etc/fstab must be performed. The Single-User runlevel
|
|
defines a script for systemd to execute, which performs the necessary
|
|
mount operations.
|
|
4) The Multi-User runlevel is more complex and will require starting a
|
|
range of services.
|
|
5) First, the security service, lockdownd, is brought online. This is the
|
|
pivotal service that converts the system from single-user to multi-user.
|
|
|
|
|
|
vim: shiftwidth=3 expandtab
|