From adabec8b8f2a0195ca9472a9eddc85e31020a58c Mon Sep 17 00:00:00 2001 From: Max Wash Date: Sat, 2 Nov 2024 15:09:42 +0000 Subject: [PATCH] add existing documentation --- doc/format.txt | 329 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 329 insertions(+) create mode 100755 doc/format.txt diff --git a/doc/format.txt b/doc/format.txt new file mode 100755 index 0000000..9fa6eca --- /dev/null +++ b/doc/format.txt @@ -0,0 +1,329 @@ +╔══════════════════════════════════════════════════════════════════════════════╗ +║ Elastic, Compressed, Content-Addressed Container ║ +║ ╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍ ║ +║ File Format Specification ║ +╚══════════════════════════════════════════════════════════════════════════════╝ + +version: 1.0 + +1 Introduction +══════════════ + + This section provides a brief introduction to the goals that EC3 is intended + to fulfill. + + + 1.1 File Format Purpose and Design Goals + ──────────────────────────────────────── + + The primary goals of the EC3 image format can be found in its name: + + * Elastic: The format should be adaptable and useful in a wide range of + use-cases + + * Compressed: The format should support compression to reduce filesize + and increase efficiency, without compromising random-access to file + data + + * Content-Addressed: The format should support data de-duplication to + further increase storage efficiency. + + * Container: The format should support storing multiple independent + filesystems. + + At a low-level, EC3 is designed to be a format for storing multiple + independent streams of data in a single image, with support for optional + features such as encryption and compression. + + On top of this base, EC3 provides facilities for storing multiple whole + filesystems within an image file. With support for extended attributes, + a directory (or whole filesystem) can be accurately captured within an + EC3 image, while compression and chunk-based data de-duplication greatly + reduces the amount of disk space required. + + + 1.2 Document Scope + ────────────────── + + This document describes the general layout of an EC3 image, and all of the + data structures contained within. It provides all of the information required + to read and write fully-featured container images. + + This document does not describe how to implement any software that can + read or write containers, with the exception of describing any algorithms + that are used. + + +2 Overview +══════════ + + This section provides a general overview of what an EC3 image is, how it + works, and a preview of some of the internal data structures. + + + 2.1 What Is An EC3 Image? + ───────────────────────── + + An EC3 image is a data file that can contain, among other things, a set of + zero or more logical filesystems, called volumes. Each volume has its own + distinct tree of directories and files, while the actual file data is shared + across all volumes within the container. + + An EC3 image is analogous to a traditional disk image containing a logical + volume management (LVM) partition. Under an LVM partition scheme, a disk + can have multiple "logical" partitions contained within a single "physical" + partition. The logical partitions are separate, just like traditional + partitions, but they all make use of the same contiguous range of sectors on + the disk. Because of this, resizing partitions within an LVM group is as + simple as changing the quota of blocks that a particular logical partition + is allowed to allocate, and doesn't require physically moving any sectors + around. + + EC3 builds upon this concept by employing cross-volume data de-duplication. + Every file that is stored within an EC3 image is split into a set of fixed- + size, content-addressed chunks. The size of these chunks is constant within + a container. A typical chunk size would be 32KB. So, if two files within + a container have the same contents, even if those files are in different + volumes, the files will reference the same range of chunks. Only one copy + of the file data is stored within the container. Even if the two files vary + to some degree, as long as at least one chunk's worth of data is identical, + some data can still be shared between the files. + + Chunks can also be compressed to further reduce file size. The chunking + system provides some additional benefits when compression is in use. Seeking + through a file is more performant, as you don't have to decompress the entire + file to reach the target offset. You can simply skip to the chunk that + corresponds to the offset you're looking for. Editing files within a volume + is also easier as, again, you only have to decompress and re-write the chunk + that has changed. + + Alongside volumes, EC3 images can contain a range of other data, including: + * Manifests + * Arbitrary binary blobs. + * Executable files. + * Digital signatures. + * Certificates for digital signature verification. + + In contrast to volumes, these other data types are much simpler. An + application can wrap their own binary data within an EC3 image and + immediately make use of features like compression, encryption, and digital + signature verification. + + + 2.2 Tags: The Core Unit Of Data + ─────────────────────────────── + + At its most basic level, an EC3 image is just a set of one or more tags. + A tag is a contiguous segment of binary data with an associated type and + identifier. The contents of a tag can be optionally encrypted and signed. + With the exception of the image header and tag table, all data contained + within an EC3 image can be found in a tag. The tag tables contains + information about all of the tags in the image. + + +3 Types & Units +═══════════════ + + This section describes the fundamental data types used within EC3 data + structures, as well as some of the units used throughout this document. + + 3.1 Integral Types + ────────────────── + + All integer values are stored in big-endian format. All signed integer values + are stored in 2s-complement format. The following integer types are used: + + Name Size Sign + ─────────────────────────────────────────────── + uint8 8 bits (1 byte) Unsigned + uint16 16 bits (2 bytes) Unsigned + uint32 32 bits (4 bytes) Unsigned + uint64 64 bits (8 bytes) Unsigned + int8 8 bits (1 byte) Signed + int16 16 bits (2 bytes) Signed + int32 32 bits (4 bytes) Signed + int64 64 bits (8 bytes) Signed + + + 3.2 String Types + ──────────────── + + All strings are stored in UTF-8 Unicode format with a trailing null + terminator byte. + + + 3.3 Storage Size Units + ────────────────────── + + Throughout this document, any reference to kilobytes, megabytes, etc refer + to the base-2 units, rather than the base-10 units. For example, 1 kilobyte + (or 1 KB) is equal to 1024 bytes (rather than 1000 bytes). + + +4 Algorithms +════════════ + + EC3 uses a range of algorithms. A selection of hashing algorithms are used + for fast data lookup and for ensuring data integrity. + + + 4.1 Fast Hast + ───────────── + + The Fast Hash algorithm is optimised for hashing string data. It is intended + for use in string-based hashmaps. The algorithm used for this purpose is + the Fowler-Noll-Vo FNV-1 hashing algorithm, with a 64-bit digest size. + + The implementation of this algorithm can be found elsewhere, but the integer + constants used to calculate hashes used by EC3 are provided here: + + * Offset Basis: 0xCBF29CE484222325 + * Prime: 0x100000001B3 + + + 4.2 Slow Hash + ───────────── + + The Slow Hash function is optimised for minimal chance of hash collisions. + It is intended to generate the content hashes used to uniquely identify data + chunks. The algorithm used for this purpose is the SHA-3 algorithm with a + 256-bit digest size. + + + 4.3 Checksum + ──────────── + + The Checksum algorithm is used to validate the contents of an EC3 image + and detect any corruption. The algorithm used for this purpose is the CRC32 + algorithm with a 32-bit digest size. + + Note that it is not intended to defend against intentional modification of an + image, as this can be easily hidden by re-calculating the checksum. EC3 + provides other features to defend against malicious modifications. + + +3 Image Header +══════════════ + + The Image Header can be found at the beginning of every EC3 image file. + It provides critical information about the rest of the file, including the + version of the file format that the file uses, and the location and size of + the tag table. The header also includes two magic numbers: + + * A signature to validate that the file is in fact an EC3 image. This + must have the value 0x45433358 ('EC3X' in ASCII). + * An application magic number that is reserved for use by the creator of + the image. + + + 3.1 Image Header Layout + ─────────────────────── + + Offset Description Type + ───────────────────────────────────────────── + 0x00 Signature uint32 + 0x04 Format Version uint16 + 0x06 Chunk Size uint16 + 0x08 Tag Table Offset uint64 + 0x10 Tag Count uint64 + 0x18 Application Magic uint64 + + 3.1.1 Signature + The Signature is found at the very beginning of the image file. It, like + all integer types, is stored in big-endian. It always has the value + 0x45433358 (or 'EC3X' is ASCII). + + 3.1.2 Format Version + This specifies which version of the EC3 Image file format + the rest of the file conforms to. Only the Signature and Format Version + header items are guaranteed to be the same across all format versions. + The format version is encoded as a 16-bit integer, with the following + format: + 0 1 + 0 6 + XXXXXXXXYYYYYYYY + + Where X encodes the major number of the format version, and Y encodes + the minor version of the format version. For example, version 3.2 would + be encoded as 0x0302. + + 3.1.3 Chunk Size + This specifies the size of all data chunks stored within the image, before + any transformation operations such as compression or encryption are + applied. + + The following chunk size values are defined: + + Header Value Chunk Size (bytes) Chunk Size (kilobytes) + ──────────────────────────────────────────────────────────────── + 0x00 16,384 16 + 0x01 32,768 32 + 0x02 65,536 64 + 0x03 131,072 128 + 0x04 262,144 256 + 0x05 524,288 512 + 0x06 1,048,576 1,024 + + 3.1.4 Tag Table Offset + This specifies the offset in bytes from the beginning of the image file + to the beginning of the tag table. + + 3.1.5 Tag Count + This specifies the number of entries in the tag table. + + 3.1.6 Application Magic + This is an application-defined value. The creator of an EC3 image can + set this to any arbitrary value. Any generic EC3 manipulation tools should + preserve the value of this field and, if the tool supports creating EC3 + images, allow the user to specify the value to store in this field. + + +4 Tags +══════ + + 4.1 The Tag Table + ───────────────── + + 4.2 Tag Types + ───────────── + + +5 Manifest +══════════ + +6 Volumes +═════════ + + 6.1 Filesystem Tree + ─────────────────── + + + 6.2 Clusters + ──────────── + + + 6.3 String Table + ──────────────── + + + 6.4 Extended Attributes + ─────────────────────── + + +7 Binary Blobs +══════════════ + + +8 Embedded Executables +══════════════════════ + + +9 Signature Verification +════════════════════════ + + +10 Encryption +═════════════ + + +vim: shiftwidth=3 expandtab