Files
ec3/doc/format.txt

544 lines
20 KiB
Plaintext
Raw Permalink Normal View History

2024-12-13 20:39:42 +00:00
+------------------------------------------------------------------------------+
| Elastic, Compressed, Content-Addressed Container |
| ................................................ |
| File Format Specification |
+------------------------------------------------------------------------------+
2024-11-02 15:09:42 +00:00
version: 1.0
1 Introduction
2024-12-13 20:39:42 +00:00
==============
2024-11-02 15:09:42 +00:00
This section provides a brief introduction to the goals that EC3 is intended
to fulfill.
1.1 File Format Purpose and Design Goals
2024-12-13 20:39:42 +00:00
----------------------------------------
2024-11-02 15:09:42 +00:00
The primary goals of the EC3 image format can be found in its name:
* Elastic: The format should be adaptable and useful in a wide range of
use-cases
* Compressed: The format should support compression to reduce filesize
and increase efficiency, without compromising random-access to file
data
2024-12-13 20:39:42 +00:00
2024-11-02 15:09:42 +00:00
* Content-Addressed: The format should support data de-duplication to
further increase storage efficiency.
* Container: The format should support storing multiple independent
filesystems.
At a low-level, EC3 is designed to be a format for storing multiple
independent streams of data in a single image, with support for optional
features such as encryption and compression.
On top of this base, EC3 provides facilities for storing multiple whole
filesystems within an image file. With support for extended attributes,
a directory (or whole filesystem) can be accurately captured within an
EC3 image, while compression and cluster-based data de-duplication greatly
2024-11-02 15:09:42 +00:00
reduces the amount of disk space required.
1.2 Document Scope
2024-12-13 20:39:42 +00:00
------------------
2024-11-02 15:09:42 +00:00
This document describes the general layout of an EC3 image, and all of the
data structures contained within. It provides all of the information required
to read and write fully-featured container images.
This document does not describe how to implement any software that can
read or write containers, with the exception of describing any algorithms
that are used.
1.3 Terminology
---------------
Several terms have particular meaning in the context of EC3. Those terms
and their meaning are listed here.
1.1.1 Image
An Image is any EC3 file. An Image contains one or more Tags containing
binary data.
1.1.2 Tag
A Tag is a contiguous range of binary data, with an associated type and
identifier. The type of a Tag determines the format of the data and how
it should be interpreted, while the identifier can be used to distinguish
one Tag from another.
1.1.2 Container
A Container refers to an EC3 file that contains one or more Volumes. It
is analogous to a storage device that contains one or more formatted
partitions. Containers represent a subset of Images: while all Containers
are Images, not all Images are Containers.
1.1.3 Volume
A Volume is a structured collection of logical files and directories
stored within a Container. It is analogous to a partition of a storage
device. The data that makes up a Volume is stored across a set of Tags
within an Image.
1.1.4 Image Key
The Image Key is the symmetric cryptograpic key used to encrypt and
decrypt data within an Image.
1.1.5 Image Certificate
The Image Certificate is a cryptographic public key and certificate that
is embedded within an Image, and is used for digital signature
verification.
1.1.6 Image Signature
The Image Signature is the cryptographic signature that is calculated
from the data stored in the Image, and stored in a dedicated Tag.
2024-11-02 15:09:42 +00:00
2 Overview
2024-12-13 20:39:42 +00:00
==========
2024-11-02 15:09:42 +00:00
This section provides a general overview of what an EC3 image is, how it
works, and a preview of some of the internal data structures.
2.1 What Is An EC3 Image?
2024-12-13 20:39:42 +00:00
-------------------------
2024-11-02 15:09:42 +00:00
An EC3 image is a data file that can contain, among other things, a set of
zero or more logical filesystems, called volumes. Each volume has its own
distinct tree of directories and files, while the actual file data is shared
across all volumes within the container.
An EC3 image is analogous to a traditional disk image containing a logical
volume management (LVM) partition. Under an LVM partition scheme, a disk
can have multiple "logical" partitions contained within a single "physical"
partition. The logical partitions are separate, just like traditional
partitions, but they all make use of the same contiguous range of sectors on
the disk. Because of this, resizing partitions within an LVM group is as
simple as changing the quota of blocks that a particular logical partition
is allowed to allocate, and doesn't require physically moving any sectors
around.
2024-12-13 20:39:42 +00:00
2024-11-02 15:09:42 +00:00
EC3 builds upon this concept by employing cross-volume data de-duplication.
Every file that is stored within an EC3 image is split into a set of fixed-
size, content-addressed clusters. The size of these clusters is constant
within a container. A typical cluster size would be 32KB. So, if two files
within a container have the same contents, even if those files are in
different volumes, the files will reference the same range of clusters. Only
one copy of the file data is stored within the container. Even if the two
files vary to some degree, as long as at least one cluster's worth of data is
identical, some data can still be shared between the files.
Clusters can also be compressed to further reduce file size. The clustering
2024-11-02 15:09:42 +00:00
system provides some additional benefits when compression is in use. Seeking
through a file is more performant, as you don't have to decompress the entire
file to reach the target offset. You can simply skip to the cluster that
2024-11-02 15:09:42 +00:00
corresponds to the offset you're looking for. Editing files within a volume
is also easier as, again, you only have to decompress and re-write the
cluster that has changed.
2024-11-02 15:09:42 +00:00
Alongside volumes, EC3 images can contain a range of other data, including:
* Manifests
* Arbitrary binary blobs.
* Executable files.
* Digital signatures.
* Certificates for digital signature verification.
In contrast to volumes, these other data types are much simpler. An
application can wrap their own binary data within an EC3 image and
immediately make use of features like compression, encryption, and digital
signature verification.
2.2 Tags: The Core Unit Of Data
2024-12-13 20:39:42 +00:00
-------------------------------
2024-11-02 15:09:42 +00:00
At its most basic level, an EC3 image is just a set of one or more tags.
A tag is a contiguous segment of binary data with an associated type and
identifier. The contents of a tag can be optionally encrypted and signed.
With the exception of the image header and tag table, all data contained
within an EC3 image can be found in a tag. The tag tables contains
information about all of the tags in the image.
3 Types & Units
2024-12-13 20:39:42 +00:00
===============
2024-11-02 15:09:42 +00:00
This section describes the fundamental data types used within EC3 data
structures, as well as some of the units used throughout this document.
3.1 Integral Types
2024-12-13 20:39:42 +00:00
------------------
2024-11-02 15:09:42 +00:00
All integer values are stored in big-endian format. All signed integer values
are stored in 2s-complement format. The following integer types are used:
Name Size Sign
2024-12-13 20:39:42 +00:00
-----------------------------------------------
2024-11-02 15:09:42 +00:00
uint8 8 bits (1 byte) Unsigned
uint16 16 bits (2 bytes) Unsigned
uint32 32 bits (4 bytes) Unsigned
uint64 64 bits (8 bytes) Unsigned
int8 8 bits (1 byte) Signed
int16 16 bits (2 bytes) Signed
int32 32 bits (4 bytes) Signed
int64 64 bits (8 bytes) Signed
3.2 String Types
2024-12-13 20:39:42 +00:00
----------------
2024-11-02 15:09:42 +00:00
All strings are stored in UTF-8 Unicode format with a trailing null
terminator byte.
3.3 Storage Size Units
2024-12-13 20:39:42 +00:00
----------------------
2024-11-02 15:09:42 +00:00
Throughout this document, any reference to kilobytes, megabytes, etc refer
to the base-2 units, rather than the base-10 units. For example, 1 kilobyte
(or 1 KB) is equal to 1024 bytes (rather than 1000 bytes).
4 Algorithms
2024-12-13 20:39:42 +00:00
============
2024-11-02 15:09:42 +00:00
EC3 uses a range of algorithms. A selection of hashing algorithms are used
for fast data lookup and for ensuring data integrity.
4.1 Fast Hast
2024-12-13 20:39:42 +00:00
-------------
2024-11-02 15:09:42 +00:00
The Fast Hash algorithm is optimised for hashing string data. It is intended
for use in string-based hashmaps. The algorithm used for this purpose is
the Fowler-Noll-Vo FNV-1 hashing algorithm, with a 64-bit digest size.
The implementation of this algorithm can be found elsewhere, but the integer
constants used to calculate hashes used by EC3 are provided here:
* Offset Basis: 0xCBF29CE484222325
* Prime: 0x100000001B3
4.2 Slow Hash
2024-12-13 20:39:42 +00:00
-------------
2024-11-02 15:09:42 +00:00
The Slow Hash function is optimised for minimal chance of hash collisions.
It is intended to generate the content hashes used to uniquely identify data
clusters. The algorithm used for this purpose is the SHA-3 algorithm with a
2024-11-02 15:09:42 +00:00
256-bit digest size.
4.3 Checksum
2024-12-13 20:39:42 +00:00
------------
2024-11-02 15:09:42 +00:00
The Checksum algorithm is used to validate the contents of an EC3 image
and detect any corruption. The algorithm used for this purpose is the CRC32
algorithm with a 32-bit digest size.
Note that it is not intended to defend against intentional modification of an
image, as this can be easily hidden by re-calculating the checksum. EC3
provides other features to defend against malicious modifications.
2024-11-03 20:00:49 +00:00
5 Image Header
2024-12-13 20:39:42 +00:00
==============
2024-11-02 15:09:42 +00:00
The Image Header can be found at the beginning of every EC3 image file.
It provides critical information about the rest of the file, including the
version of the file format that the file uses, and the location and size of
the tag table. The header also includes two magic numbers:
* A signature to validate that the file is in fact an EC3 image. This
must have the value 0x45433358 ('EC3X' in ASCII).
* An application magic number that is reserved for use by the creator of
the image.
2024-11-03 20:00:49 +00:00
5.1 Image Header Layout
2024-12-13 20:39:42 +00:00
-----------------------
2024-11-02 15:09:42 +00:00
2024-11-03 20:00:49 +00:00
Offset Description Type
2024-12-13 20:39:42 +00:00
----------------------------------------
2024-11-03 20:00:49 +00:00
0x00 Signature uint32
0x04 Format Version uint16
0x06 Cluster Size uint16
2024-11-03 20:00:49 +00:00
0x08 Tag Table Offset uint64
0x10 Tag Count uint64
0x18 Application Magic uint64
2024-11-02 15:09:42 +00:00
2024-11-03 20:00:49 +00:00
5.1.1 Signature
2024-11-02 15:09:42 +00:00
The Signature is found at the very beginning of the image file. It, like
all integer types, is stored in big-endian. It always has the value
0x45433358 (or 'EC3X' is ASCII).
2024-11-03 20:00:49 +00:00
5.1.2 Format Version
2024-11-02 15:09:42 +00:00
This specifies which version of the EC3 Image file format
the rest of the file conforms to. Only the Signature and Format Version
header items are guaranteed to be the same across all format versions.
The format version is encoded as a 16-bit integer, with the following
format:
0 1
0 6
XXXXXXXXYYYYYYYY
2024-12-13 20:39:42 +00:00
2024-11-02 15:09:42 +00:00
Where X encodes the major number of the format version, and Y encodes
the minor version of the format version. For example, version 3.2 would
be encoded as 0x0302.
5.1.3 Cluster Size
This specifies the size of all data clusters stored within the image,
before any transformation operations such as compression or encryption are
2024-11-02 15:09:42 +00:00
applied.
The following cluster size values are defined:
2024-11-02 15:09:42 +00:00
Header Value Cluster Size (bytes) Cluster Size (kilobytes)
--------------------------------------------------------------------
2025-01-30 18:10:38 +00:00
0x00 4,096 4
0x01 8,192 8
0x02 16,384 16
0x03 32,768 32
0x04 65,536 64
2024-11-02 15:09:42 +00:00
2024-11-03 20:00:49 +00:00
5.1.4 Tag Table Offset
2024-11-02 15:09:42 +00:00
This specifies the offset in bytes from the beginning of the image file
to the beginning of the tag table.
2024-11-03 20:00:49 +00:00
5.1.5 Tag Count
2024-11-02 15:09:42 +00:00
This specifies the number of entries in the tag table.
2024-11-03 20:00:49 +00:00
5.1.6 Application Magic
2024-11-02 15:09:42 +00:00
This is an application-defined value. The creator of an EC3 image can
set this to any arbitrary value. Any generic EC3 manipulation tools should
preserve the value of this field and, if the tool supports creating EC3
images, allow the user to specify the value to store in this field.
2024-11-03 20:00:49 +00:00
6 Tags
2024-12-13 20:39:42 +00:00
======
2024-11-02 15:09:42 +00:00
2024-11-03 20:00:49 +00:00
Tags are the fundamental units of data storage in an EC3 image. Every image
contains one or more tags. A tag is essentially a contiguous range of data
within an image, with an associated type, identifier, and flags. Various
data processing layers can be applied to the contents of a tag, such as
encryption or compression. Every tag within an image can be referenced either
by its index within the tag table or by an optional 64-bit identifier.
6.1 The Tag Table
2024-12-13 20:39:42 +00:00
-----------------
2024-11-02 15:09:42 +00:00
2024-11-03 20:00:49 +00:00
The Tag Table describes all of the tags in an image. Its location and size
can be found by parsing the Image Header. The Tag Table consists of a number
of entries, one for each tag in the image.
Each entry in the Tag Table has the following layout:
Offset Description Type
2024-12-13 20:39:42 +00:00
----------------------------------------
2024-11-03 20:00:49 +00:00
0x00 Tag Type uint32
0x04 Flags uint32
0x08 Checksum uint32
0x1C Reserved uint32
0x20 Identifier uint64
0x28 Offset uint64
0x30 Size uint64
0x38 Reserved uint64
6.1.1 Tag Type
A 32-bit integer indicating the type of the tag. EC3 defines a range
of different tag types, which can be found in Section 4.2
6.1.2 Flags
Flags describing certain attributes of a tag, such as whether the tag
is compressed, encrypted, or signed. The full set of flags can be found
in Section 6.3
6.1.3 Checksum
A checksum of the tag data, calculated on the raw data as it appears
on-disk, after any Data Filters (compression, encryption, etc)
2024-11-03 20:00:49 +00:00
have been applied. This checksum should be checked before the tag data is
processed any further. The checksum is calculated using the algorithm
described in Section 4.3
6.1.4 Identifier
An arbitrary 64-bit integer that can be used to identify a tag. Every tag
within an image must have a unique identifier. The only exception is the
identifier value 0x00, which any number of tags can use as their
identifier and is used to indicate that a tag has no identifier.
6.1.5 Offset and Size
The offset from the beginning of the image file to the beginning of the
tag data, and the length of the tag data. Both values are measured in
bytes.
6.2 Tag Types
2024-12-13 20:39:42 +00:00
-------------
2024-11-02 15:09:42 +00:00
2024-11-03 20:00:49 +00:00
The type of a tag determines the format of the data contained within it.
6.2.1 VOLU: Volume
Volume tags contain the filesystem tree and file/directory metadata for a
single volume within the container.
6.2.2 CTAB: Cluster Table
The Cluster Table contains the file data clusters for all volumes within
the container.
2024-11-03 20:00:49 +00:00
6.2.3 XATR: Extended Attributes Table
The Extended Attributes table contains any extended attributes referenced
by any file or directory stored in any of the volumes in the container.
6.2.4 STAB: String Table
The String Table contains all of the strings used as file/directory names
for all files and directores stored in the container.
6.2.5 MFST: Manifest
The manifest is a key-value data store that holds information describing
the container. Apart from a few required keys, any arbitrary keys and
values can be stored in the manifest.
6.2.6 BLOB: Binary Data
Binary blobs are contiguous buffers of arbitrary binary data. EC3 places
no requirements on the length or layout of this data, so these tags can
be used for any application-defined purpose.
6.2.7 EXEC: Executable
Executable tags are used to store embedded executable files. For certain
executable file formats, these tags can also include auxiliary information
about the executable file to allow readers to load and run the executable
without having to implement a parser for the executable file format.
6.2.8 CERT: Digital Certificate
If any part of the image is digitally signed, it will also contain one or
more Digital Certificate tags. These tags contain either:
a) the certificate used to sign the container; or
b) (optionally) any intermediate certificates needed to link the
signing certificate back to a trusted root certificate.
6.2.9 CSIG: Digital Signature
If any part of the image is digitally signed, this tag contains the actual
signature data.
6.3 Tag Flags
2024-12-13 20:39:42 +00:00
-------------
2024-11-03 20:00:49 +00:00
A Tag can have a number of different flags set. A full list of these flags,
including their values and meanings, is provided here.
6.3.1 0x00000001: Signed
The data in this Tag is included in the Image's digital
signature.
6.3.2 0x00000002: Compressed
The data in this Tag is compressed. Note that, in most cases, this flag
will not be enabled on the Cluster Table, as each Cluster is compressed
separately.
6.3.3 0x00000004: Encrypted
The data in this Tag is encrypted using the Image Key.
2024-11-03 20:00:49 +00:00
6.4 Tag Identifiers
2024-12-13 20:39:42 +00:00
-------------------
Every Tag in an Image must have a unique Identifier. The Identifier is a
64-bit integer value, which can optionally be interpreted as a string of no
more than 8 ASCII characters.
If no Identifier is specified for a Tag, a sequential Identifier should be
assigned automatically.
6.5 Data Filtering
------------------
The different types of processing that can be performed on a Tag's data, such
as encryption and compression, are called Filters. Filters are applied to a
Tag's data as it is being written, and are applied in reverse order when the
data is being read.
To facilitate multiple Filters being used together, the order in which
Filters are applied to a particular Tag's data is strictly defined. When
It is critical that Filters are applied in the correct order to maximise
effectiveness. For example, Tag data must be compressed BEFORE it is encrypted.
Encrypting data greatly increases its entropy and "randomness", making it
essentially uncompressable.
The types of Filters supported by EC3 are listed below, in the order they are
applied when writing data to a Tag. When reading Tag data, the filters are
applied in the reverse order.
6.3.1 Compression
Tag data is compressed before being written to the Image to reduce
file size. This is the only Filter that changes the amount of data that
is written to a file.
Note that this Filter will reduce I/O performance and require that data
is read sequentially from the Tag. Random access to compressed Tag data
is not supported.
6.3.2 Encryption
Tag data is encrypted using the specified encryption key before being
written to disk.
6.3.3 Digital Signature
Tag data is included in the set of data that makes up the Image's digital
signature. Unlike the other Filters, this one does not modify the Tag
data that is written to the Image, but rather specifies that the data is
included as part of the whole Image's digital signature hash.
More information about how the Image Signature is calculated and verified
can be found in Section 11.
2024-11-02 15:09:42 +00:00
7 String Table
==============
8 Manifest
2024-12-13 20:39:42 +00:00
==========
2024-11-02 15:09:42 +00:00
9 Volumes
2024-12-13 20:39:42 +00:00
=========
2024-11-02 15:09:42 +00:00
9.1 Filesystem Tree
2024-12-13 20:39:42 +00:00
-------------------
2024-11-02 15:09:42 +00:00
9.2 Clusters
2024-12-13 20:39:42 +00:00
------------
2024-11-02 15:09:42 +00:00
9.3 Extended Attributes
2024-12-13 20:39:42 +00:00
-----------------------
2024-11-02 15:09:42 +00:00
10 Binary Blobs
===============
2024-11-02 15:09:42 +00:00
11 Embedded Executables
2024-12-13 20:39:42 +00:00
=======================
2024-11-02 15:09:42 +00:00
12 Signature Verification
2024-12-13 20:39:42 +00:00
=========================
2024-11-02 15:09:42 +00:00
13 Encryption
2024-12-13 20:39:42 +00:00
=============
2024-11-02 15:09:42 +00:00
vim: shiftwidth=3 expandtab