add existing documentation

This commit is contained in:
2024-11-02 15:09:42 +00:00
parent f68ac974d8
commit adabec8b8f

329
doc/format.txt Executable file
View File

@@ -0,0 +1,329 @@
╔══════════════════════════════════════════════════════════════════════════════╗
║ Elastic, Compressed, Content-Addressed Container ║
║ ╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍ ║
║ File Format Specification ║
╚══════════════════════════════════════════════════════════════════════════════╝
version: 1.0
1 Introduction
══════════════
This section provides a brief introduction to the goals that EC3 is intended
to fulfill.
1.1 File Format Purpose and Design Goals
────────────────────────────────────────
The primary goals of the EC3 image format can be found in its name:
* Elastic: The format should be adaptable and useful in a wide range of
use-cases
* Compressed: The format should support compression to reduce filesize
and increase efficiency, without compromising random-access to file
data
* Content-Addressed: The format should support data de-duplication to
further increase storage efficiency.
* Container: The format should support storing multiple independent
filesystems.
At a low-level, EC3 is designed to be a format for storing multiple
independent streams of data in a single image, with support for optional
features such as encryption and compression.
On top of this base, EC3 provides facilities for storing multiple whole
filesystems within an image file. With support for extended attributes,
a directory (or whole filesystem) can be accurately captured within an
EC3 image, while compression and chunk-based data de-duplication greatly
reduces the amount of disk space required.
1.2 Document Scope
──────────────────
This document describes the general layout of an EC3 image, and all of the
data structures contained within. It provides all of the information required
to read and write fully-featured container images.
This document does not describe how to implement any software that can
read or write containers, with the exception of describing any algorithms
that are used.
2 Overview
══════════
This section provides a general overview of what an EC3 image is, how it
works, and a preview of some of the internal data structures.
2.1 What Is An EC3 Image?
─────────────────────────
An EC3 image is a data file that can contain, among other things, a set of
zero or more logical filesystems, called volumes. Each volume has its own
distinct tree of directories and files, while the actual file data is shared
across all volumes within the container.
An EC3 image is analogous to a traditional disk image containing a logical
volume management (LVM) partition. Under an LVM partition scheme, a disk
can have multiple "logical" partitions contained within a single "physical"
partition. The logical partitions are separate, just like traditional
partitions, but they all make use of the same contiguous range of sectors on
the disk. Because of this, resizing partitions within an LVM group is as
simple as changing the quota of blocks that a particular logical partition
is allowed to allocate, and doesn't require physically moving any sectors
around.
EC3 builds upon this concept by employing cross-volume data de-duplication.
Every file that is stored within an EC3 image is split into a set of fixed-
size, content-addressed chunks. The size of these chunks is constant within
a container. A typical chunk size would be 32KB. So, if two files within
a container have the same contents, even if those files are in different
volumes, the files will reference the same range of chunks. Only one copy
of the file data is stored within the container. Even if the two files vary
to some degree, as long as at least one chunk's worth of data is identical,
some data can still be shared between the files.
Chunks can also be compressed to further reduce file size. The chunking
system provides some additional benefits when compression is in use. Seeking
through a file is more performant, as you don't have to decompress the entire
file to reach the target offset. You can simply skip to the chunk that
corresponds to the offset you're looking for. Editing files within a volume
is also easier as, again, you only have to decompress and re-write the chunk
that has changed.
Alongside volumes, EC3 images can contain a range of other data, including:
* Manifests
* Arbitrary binary blobs.
* Executable files.
* Digital signatures.
* Certificates for digital signature verification.
In contrast to volumes, these other data types are much simpler. An
application can wrap their own binary data within an EC3 image and
immediately make use of features like compression, encryption, and digital
signature verification.
2.2 Tags: The Core Unit Of Data
───────────────────────────────
At its most basic level, an EC3 image is just a set of one or more tags.
A tag is a contiguous segment of binary data with an associated type and
identifier. The contents of a tag can be optionally encrypted and signed.
With the exception of the image header and tag table, all data contained
within an EC3 image can be found in a tag. The tag tables contains
information about all of the tags in the image.
3 Types & Units
═══════════════
This section describes the fundamental data types used within EC3 data
structures, as well as some of the units used throughout this document.
3.1 Integral Types
──────────────────
All integer values are stored in big-endian format. All signed integer values
are stored in 2s-complement format. The following integer types are used:
Name Size Sign
───────────────────────────────────────────────
uint8 8 bits (1 byte) Unsigned
uint16 16 bits (2 bytes) Unsigned
uint32 32 bits (4 bytes) Unsigned
uint64 64 bits (8 bytes) Unsigned
int8 8 bits (1 byte) Signed
int16 16 bits (2 bytes) Signed
int32 32 bits (4 bytes) Signed
int64 64 bits (8 bytes) Signed
3.2 String Types
────────────────
All strings are stored in UTF-8 Unicode format with a trailing null
terminator byte.
3.3 Storage Size Units
──────────────────────
Throughout this document, any reference to kilobytes, megabytes, etc refer
to the base-2 units, rather than the base-10 units. For example, 1 kilobyte
(or 1 KB) is equal to 1024 bytes (rather than 1000 bytes).
4 Algorithms
════════════
EC3 uses a range of algorithms. A selection of hashing algorithms are used
for fast data lookup and for ensuring data integrity.
4.1 Fast Hast
─────────────
The Fast Hash algorithm is optimised for hashing string data. It is intended
for use in string-based hashmaps. The algorithm used for this purpose is
the Fowler-Noll-Vo FNV-1 hashing algorithm, with a 64-bit digest size.
The implementation of this algorithm can be found elsewhere, but the integer
constants used to calculate hashes used by EC3 are provided here:
* Offset Basis: 0xCBF29CE484222325
* Prime: 0x100000001B3
4.2 Slow Hash
─────────────
The Slow Hash function is optimised for minimal chance of hash collisions.
It is intended to generate the content hashes used to uniquely identify data
chunks. The algorithm used for this purpose is the SHA-3 algorithm with a
256-bit digest size.
4.3 Checksum
────────────
The Checksum algorithm is used to validate the contents of an EC3 image
and detect any corruption. The algorithm used for this purpose is the CRC32
algorithm with a 32-bit digest size.
Note that it is not intended to defend against intentional modification of an
image, as this can be easily hidden by re-calculating the checksum. EC3
provides other features to defend against malicious modifications.
3 Image Header
══════════════
The Image Header can be found at the beginning of every EC3 image file.
It provides critical information about the rest of the file, including the
version of the file format that the file uses, and the location and size of
the tag table. The header also includes two magic numbers:
* A signature to validate that the file is in fact an EC3 image. This
must have the value 0x45433358 ('EC3X' in ASCII).
* An application magic number that is reserved for use by the creator of
the image.
3.1 Image Header Layout
───────────────────────
Offset Description Type
─────────────────────────────────────────────
0x00 Signature uint32
0x04 Format Version uint16
0x06 Chunk Size uint16
0x08 Tag Table Offset uint64
0x10 Tag Count uint64
0x18 Application Magic uint64
3.1.1 Signature
The Signature is found at the very beginning of the image file. It, like
all integer types, is stored in big-endian. It always has the value
0x45433358 (or 'EC3X' is ASCII).
3.1.2 Format Version
This specifies which version of the EC3 Image file format
the rest of the file conforms to. Only the Signature and Format Version
header items are guaranteed to be the same across all format versions.
The format version is encoded as a 16-bit integer, with the following
format:
0 1
0 6
XXXXXXXXYYYYYYYY
Where X encodes the major number of the format version, and Y encodes
the minor version of the format version. For example, version 3.2 would
be encoded as 0x0302.
3.1.3 Chunk Size
This specifies the size of all data chunks stored within the image, before
any transformation operations such as compression or encryption are
applied.
The following chunk size values are defined:
Header Value Chunk Size (bytes) Chunk Size (kilobytes)
────────────────────────────────────────────────────────────────
0x00 16,384 16
0x01 32,768 32
0x02 65,536 64
0x03 131,072 128
0x04 262,144 256
0x05 524,288 512
0x06 1,048,576 1,024
3.1.4 Tag Table Offset
This specifies the offset in bytes from the beginning of the image file
to the beginning of the tag table.
3.1.5 Tag Count
This specifies the number of entries in the tag table.
3.1.6 Application Magic
This is an application-defined value. The creator of an EC3 image can
set this to any arbitrary value. Any generic EC3 manipulation tools should
preserve the value of this field and, if the tool supports creating EC3
images, allow the user to specify the value to store in this field.
4 Tags
══════
4.1 The Tag Table
─────────────────
4.2 Tag Types
─────────────
5 Manifest
══════════
6 Volumes
═════════
6.1 Filesystem Tree
───────────────────
6.2 Clusters
────────────
6.3 String Table
────────────────
6.4 Extended Attributes
───────────────────────
7 Binary Blobs
══════════════
8 Embedded Executables
══════════════════════
9 Signature Verification
════════════════════════
10 Encryption
═════════════
vim: shiftwidth=3 expandtab