Encoding: prefixed-compact

Sun, Dec 8, 2024 ❝A new, compact, prefix-based binary encoding for structuring bytes.❞

Contents

prefixed-compact is a prefix-based binary encoding with a limited, compact header for structured serialization of bytes.

Introduction

The starting point were TLV, LTV, and similar “on-the-wire” value-representations. TLVs and LTVs are straight-forward, there is little overhead and they are easy to read. However, they are also very limited. They encode a type (which an application may not know/understand/support) and a length and value. The length allows skipping over the “value”-body, such that unsupported/unknown types can be ignored and skipped over.

This encoding aims for a similar minimal level of features, setting its scope at structuring bytes.

Encodings like JSON, YAML, XML, and many more provide interesting features and sport amazing capabilities at value control, but at the cost of a lot of complexity. JSON is easy to understand, but is a text-based encoding thus inefficient for raw bytes of data. Due to its start- and end-tags for composite types, there is need for proper matching. However, JSON’s ability to express structure with its arrays and objects, is quite powerful. JSON offers a few data-types giving it an elementary ability to express semantics, but no restrictions, at least not without JSON-Schema. XML provides many capabilities, but has these quirks where, for example, XML attributes don’t really have a unique purpose, such that – syntactically speaking – attributes can always also be expressed as nested elements. It is only when a schema or application enforces one mechanism that a choice is made, instead of by necessity. Furthermore, XML inherently recognizes order of child elements, meaning that this notion of order is ever-present even if it is not a prime concern. (Yes, IIRC, this can be controlled or redefined in a schema.) JSON, on the other hand, doesn’t recognize this notion of order for object fields, thus giving its objects one specific purpose: grouping related fields, identified by key, and (sometimes) composing the implied composite structure.

Many more elaborate serialization formats and IPC protocols are very powerful and efficient, but require defining an interface, possibly a schema beforehand, possibly executing a compiler to produce machine-generated code to adopt. None of these things are bad, per se, but they do make things vastly more elaborate and more complicated.

Status

Status: the specification is essentially fixed.

Open considerations:

Need for a “magic” value to indicate an extension to this encoding or a migration to succeeding encoding, by using the small redundant area in the size-bits of the 2-byte header: [1,15]. This use-case is covered by strongly advised ("should") use the 1-byte header for sizes smaller than 16. (See section: Implementation.)

Design

I noticed that there wasn’t a convenient, straight-forward, go-to mechanism for structuring bytes to send “over-the-wire” for IPC and data-transfers that is least-intrusive and efficient for one-off use. Using the basic notion of TLV (type-length-value) as starting point and appreciating the ability to structure bytes, e.g. collections of data.

Target: a minimal, structured format for bytes

Binary encoding (strict, efficient, compact)
Be strict.
Prefix-based, with header
A small prefixed header defines what to expect of the value, including counts -the number of bytes or entries- and structure.
Structured
Some structure is necessary to avoid a need for conventions on structure. Sequences of dynamic sizes, any number of values. Maps to express key-value pairings. Nesting. Ability to indicate the element count.
Support for raw bytes, i.e. no space-loss due to a need for (nested) encoding
Bytes makes transferring data efficient. Most data can be converted to or encoded into bytes already. No need for (nested) encoding to store raw data.
Limited scope: no data-types, no null, no interpretation, no restrictions, no validation
We carry bytes in structured format. Interpretation of the values is left to the applications.
Support primarily small values
Structured data often comes in small values or small numbers of elements.
Ability to handle (extendable to) arbitrarily large values and numbers of elements.
Ability to store large byte-arrays when necessary, although with some overhead. Avoid unnecessary limitations.
No unnecessary restrictions on available mechanisms.
The few mechanisms offered by the encoding may be applied as needed, avoiding unnecessary restrictions. Orthogonality of bit-flags.

This article discusses the decisions and thoughts behind decisions. For the specifics, the documentation covers this quite nicely.

Formal specification

A (somewhat) formal specification for the encoding.

value = data | (key , value) | sequence | map ;
data = header , {byte} ;
key = header , {byte} ;
sequence = header , {value} ;
map = header , {key , value} ;
header = termination-bit , valuetype-bit , multiplicity-bit ,
    headersize-bit , 4-bit-size , [8-bit-size] ;

with

valuetype=0,multiplicity=0 for data (size for raw bytes)
valuetype=1,multiplicity=0 for (key, value) (size for key)
valuetype=0,multiplicity=1 for sequence (size for number of elements)
valuetype=1,multiplicity=1 for map (size for number of key-value-pairs)

additionally

header must include the second byte, as part of big-endian unsigned size (12 bits in total), if-and-only-if the headersize-bit is set. The 4-bit size ranges [0,15]. The 12-bit size ranges [1,4096], thus requiring an offset-correction by 1.
if termination-bit is unset, there must follow a value of same valuetype | multiplicity.
length of raw data ({byte}) in data and key must correspond to length indicated in header; actual number of entries in sequence and map must correspond to count indicated in header.
the minimum necessary header-size should be used, i.e. sizes/counts ranging [0,15] expressed with single-byte header.

The prefixed header defines how the next bytes are to be interpreted. It consists of 4 bit-flags and a big-endian unsigned integer value of 4 or 12 bits indicating a count, i.e. a number of bytes or elements.

Structure:

termination-bit || valuetype-bit || multiplicity-bit || headersize-bit || 4-bits-of-size || [8-bits-of-size]

termination-bit (0x80): the flag indicates whether this record is the conclusion of the value. If the bit is unset, a next record is expected of the same type which continues the value-type.
valuetype-bit (0x40): the flag indicates whether the value is a plain value or key-value-pair, with the next bytes representing the key and must be followed up with a (semantically-corresponding) value.
multiplicity-bit (0x20): the flag indicates whether to expect a single value or a collection of values; a sequence of values if valuetype-bit is unset, or a map/dictionary/object if valuetype-bit is set.
headersize-bit (0x10): the flag indicates whether to expect a single-byte header, i.e. next 4 bits represent the count, or a 2-byte header, i.e. next 12 bits represent the count.

Size (count) is stored as an 1/2-byte big-endian unsigned integer, naturally leaving room for flags in the most-significant bits.

4-bit size (1 byte incl. bit-flags): allows specifying counts (number of bytes or elements) in range [0, 15].
12-bit size (2 bytes incl. bit-flags): allows specifying counts in range [1, 4096], requiring an offset correction, with implementations strongly encouraged to always use the 1-byte header if sufficient.

The 4-bit/12-bit value represents a count rather than a number of bytes. The count of a plain value represents a size in number of bytes, and for a key-value-pair represents the number of bytes of only the key. When the multiplicity-bit is set, the count represents the number of entries (values or key-value-pairs) in the sequence or map, respectively.

Body

Once the header is read, a corresponding body must follow. The two bits, valuetype and multiplicity, together indicate one of four types of value:

In all cases, the termination-bit indicates whether this record concludes the value. If the termination-bit is not set on a value, the follow up, which must be of the same type, extends the first value. This means a map can contain many multiples of 4096 entries even if the encoding is limited to expressing a count of at most 4096. The termination-bit may be left unset for smaller sizes too. For example, leaving the termination-bit unset would allow indefinite streaming of additional values, only to be terminated at the end on an empty (0-sized) value.

Examples

Some examples to demonstrate the expressiveness of the encoding, and its results as bytes. The symbolic representation illustrates how flags are used in the encoding, making results easier to read, while the raw bytes representation shows actual results as raw bytes.

As a reminder, the following bit-flags are used:

TERMINATION  = 0b10000000 (0x80, 128)
VALUETYPE    = 0b01000000 (0x40,  64)
MULTIPLICITY = 0b00100000 (0x20,  32)
HEADERSIZE   = 0b00010000 (0x10,  16)

Note: 2-byte headers have value offset by 1.

In these examples, the symbolic representation uses ‘|’ to denote bit-wise OR. For example, “5|TERMINATION” = 0x85 or 133.

Bytes:
A basic byte-array, here using characters.
- JSON:"Hello"
- Prefixed-compact: [5|TERMINATION, 'H', 'e', 'l', 'l', 'o'] (symbolic)
- Prefixed-compact: [0x85, 0x48, 0x65, 0x6c, 0x6c, 0x6f] (raw)

explanation: a byte-array, first byte is the header with size and the termination-flag. Then follow the content-bytes.

Key-value:
A key-value-pair with key ‘version’ and value ‘1’.
- JSON: "version": 1
- Prefixed-compact: [7|TERMINATION|VALUETYPE, 'v', 'e', 'r', 's', 'i', 'o', 'n', 1|TERMINATION, 1] (symbolic)
- Prefixed-compact: [0xc7, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f, 0x6e, 0x81, 0x01] (raw)

explanation: a key-value-pair, with header indicating key properties, followed by the key. Then another value must follow, as specified, in this case a plain byte-array.

Sequence:
A sequence (list, array) of 3 elements.
- JSON: [1, 2, 3]
- Prefixed-compact: [3|TERMINATION|MULTIPLICITY, 1|TERMINATION, 1, 1|TERMINATION, 2, 1|TERMINATION, 3] (symbolic)
- Prefixed-compact: [0xa3, 0x81, 0x01. 0x81, 0x02. 0x81, 0x03] (raw)

explanation: a _sequence of values. In this case a sequence of length 3, followed by 3 1-byte plain values.

Map:
A map (object, dictionary) with keys ‘first’ and ’last’, and their values.
- JSON: {"first": "hello", "last": "world"}
- Prefixed-compact: [2|TERMINATION|VALUETYPE|MULTIPLICITY, 5|TERMINATION|VALUETYPE, 'f', 'i', 'r', 's', 't', 5|TERMINATION, 'h', 'e', 'l', 'l', 'o', 4|TERMINATION|VALUETYPE, 'l', 'a', 's', 't', 5|TERMINATION, 'w', 'o', 'r', 'l', 'd'] (symbolic)
- Prefixed-compact: [0xe2, 0xc5, 0x66, 0x69, 0x72, 0x73, 0x74, 0x85, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0xc4, 0x6c, 0x61, 0x73, 0x74, 0x85, 0x77, 0x6f, 0x72, 0x6c, 0x64] (raw)

explanation: the map contains 2 key-value-pairs, each key-value-pair by specification consists of first the key, then some arbitrary value-type. These count as 1 key-value entry in the map. Next follows the second key-value entry.

Extending bytes:
If you really like empty byte-arrays.
- JSON: "Hi!"
- Prefixed-compact: [0, 0, 0, 0, 0, 1, 'H', 0, 0, 0, 0, 0, 0, 1, 'i', 0, 0, 0, 0, 0, 0, 1|TERMINATION, '!'] (symbolic)
- Prefixed-compact: [0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x48, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x69, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x81, 0x21] (raw)

explanation: a series of 0-byte unterminated plain values, with occasional 1-byte value in between, only terminated at the last (20th) plain value.

Larger sequence:
The 2-byte header, in this case, enables larger numbers in sequences.
- JSON: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, "Hello"]
- Prefixed-compact: [0|TERMINATION|MULTIPLICITY|HEADERSIZE, 15, 1|TERMINATION, 1, 1|TERMINATION, 2, 1|TERMINATION, 3, 1|TERMINATION, 4, 1|TERMINATION, 5, 1|TERMINATION, 6, 1|TERMINATION, 7, 1|TERMINATION, 8, 1|TERMINATION, 9, 1|TERMINATION, 10, 1|TERMINATION, 11, 1|TERMINATION, 12, 1|TERMINATION, 13, 1|TERMINATION, 14, 1|TERMINATION, 15, 5|TERMINATION, 'H', 'e', 'l', 'l', 'o'] (symbolic)
- Prefixed-compact: [0xb0, 0x0f, 0x81, 0x01, 0x81, 0x02, 0x81, 0x03, 0x81, 0x04, 0x81, 0x05, 0x81, 0x06, 0x81, 0x07, 0x81, 0x08, 0x81, 0x09, 0x81, 0x0a, 0x81, 0x0b, 0x81, 0x0c, 0x81, 0x0d, 0x81, 0x0e, 0x81, 0x0f, 0x85, 0x48, 0x65, 0x6c, 0x6c, 0x6f] (raw)

explanation: a series of elements in a sequence. There are 16 elements, requiring a 2-byte header to express that number of elements. Note the offset by 1 to allow for a maximum of 4096 elements. Because of big-endianness, the first header-byte contains the highest size-bits, all zero in this case.

Characteristics and capabilities

Let’s discuss some key highlights of possibilities with the bit-flags provided. Consider that many of these ideas can be taken into extremes that are wildly impractical. The benefit is not in the extremes but in the control that they offer. Given no unnecessary restrictions, it is possible to work with large batches or just the opposite, depending on your use-case. Embedded use-cases may want to trade little bits of overhead with extra headers and smaller values, while powerful computers may want to benefit from larger sizes and repeated extensions. The encoding aims to define only those restrictions necessary to make concept work.

Prefixed header, provides information on the raw bytes immediately following. There are no special characters, no need to escape, no restriction on the range of byte-values.

Prefixed header with no closing tags/tokens make it possible to process data in streaming fashion. It also means there is no redundancy on structural information to correct or detect encoding errors.

Structure for bytes, this encoding is designed to carry bytes across. The available features, though limited, are selected to handle small and large values, plain values and collections of values, unknown and/of varying numbers of bytes and/or elements, extending existing values, and to vary/adapt the (parts of) values to suitable sizes given unknown/varying circumstances/capacities.

Avoiding convention, by introducing a key-value-pair and multiplicity.

This minimal expression of structure avoids the need to “agree on what a value means”. E.g., a number of elements that follow, i.e. representation of a list by convention. Instead, the multiplicity-bit of the encoding indicates that the size in the header indicates the number of elements to follow.
Similarly, the key-value-pair makes it possible to define some keys with predefined meaning, such that there is no need to rely on strict expectation of values at specific positions, e.g. “the first value represents the protocol version” and “the second value represents a particular format”. Instead, one can use the key-value to express “labeled” values such that these can be given specific/special meaning.

Avoiding interpretation of values means that we do not prescribe whether a value must be expressed as big-endian or little-endian, or in 8 bytes or 4 bytes, or whether floating points are an acceptable data-type. Values essentially “carry bytes across in structured format”. Everything else is left to the application to determine as best suited.

Implicit order of occurrence in the data-stream. Although not expressed within the value-bytes, the order of occurrence may matter if you choose so. Sequences may express priority or age based on the position in the list, if needed. If not, a sequence may represent a set of elements.

Freedom of interpretation: there is no enforced interpretation of value-bytes. For example, duplicates are not filtered/detected on-the-fly, so it is left to the applications to define semantics for duplicate occurrences of keys in a map-type. Either, reject the data, or replace the previous value, or concatenate to the previous value, or add as additional element to the same key. The termination-bit allows extending a value, without prescribing the semantics of the bytes being carried. This includes repeated elements in sequences, repeated keys in maps, etc.

Expression of small and large values is possible using the headersize-bit. With the bit unset, a 1-byte header is expected, consisting of 4 flag-bits and 4 bits left over to express sizes, i.e. [0, 15]. With the headersize-bit set, an additional byte of header is available, making it possible to express sizes up to 12 bits. Given that we can already express values up to 15, 12-bits are defined to be in range [1, 4096], requiring an offset-correction.

Arbitrarily large values and collections: although support is offered to immediately express any number from 0 to 4096, the proper use of the termination-bit will allow you to extend the value. 4096 Is far from “incomprehensibly large” but can be extended indefinitely. Even if 4 KiB is relatively small when plain bytes are concerned, 4096 elements are already a significant number for sequences and maps. Consider that both sequences and maps define their elements as value, meaning that nested structure is allowed. Note that plain bytes are expressed in chunks of 4 KiB, requiring no encoding, and by using the termination-bit can be extended up to infinity, resulting in 2 bytes of overhead per 4 KiB.

Arbitrarily partitioned value-data: the termination-bit allows dividing up the expression of large values and collection (entries) arbitrarily. The data, whether bytes or entries, may be partitioned in whichever way is suitable, with numbers from 0 to 4096. Until the termination-bit is set, a next partition of the same type will always be expected.

Adaptability to unknown number of elements: given that the termination-bit concludes a value, it is possible to encode/stream structured data without knowing exact sizes beforehand, without even knowing intermediate sizes or with varying intermediate sizes. Until the termination-bit is specified, any value can be extended arbitrarily.

Redundancy, error-correction or matchable indicators are out-of-scope, meaning that if there is risk of data-corruption, additional measures need to be taken outside of the domain of this encoding. Similarly, verification is only possible to the extent that some violations of encoding rules might be detected in the encoded data.

No unnecessary restrictions: the encoding does not enforce unnecessary restrictions. This means, for example, that keys can become arbitrarily large. There will probably be only very few cases where this is useful/necessary, but it isn’t forbidden. It will undoubtedly become a problem in the embedded ecosystem. It is recommended to define custom limits for this ecosystem, rather than have the encoding make things complicated by defining arbitrary restrictions on top of necessary control-bits. Embedded systems, in their turn, can simply reject encodings that violate their restrictions or that reach beyond their capacity, without this being a concern of the encoding itself.

Trade-offs

The following trade-offs were evaluated. Some rationale is provided.

Why “carry” only bytes? It seems to be more work to structure a composite payload for transfer than it is to acquire a byte-representation for complex objects. Furthermore, with the existence of many representations, e.g. flavors of floating point representations (float, double, Bfloat16, Posits) and integers (endianness and signedness). The complexity that comes with supporting any kind of data-type comes with trade-offs. Even if you decide to support few data-types, it will then leave a void for all the data-types without first-class support. Furthermore, many encodings cannot handle raw bytes, so by deferring the problem of representing a value in favor of simply “carrying bytes”, we simplify matters significantly and let’s us focus on structured expressions of composite data rather than data-types.
Aim for maximum potential with small 1-/2-byte header: 4 bit-flags, 4-/12-bit size.
Consider strong contrast between sizes representing numbers of elements and raw bytes, with raw bytes potentially getting very large. Support small numbers with [0, 15], bigger numbers with [1, 4096], and provide a secondary mechanism for the less likely case, to “extend” to arbitrarily large sizes. We cannot be compact/efficient/low-overhead and also provide first-class support for arbitrarily large sizes through literal exact expressions of size.
Offset sizes expressed with 2-byte header, such that a full 4 KiB can be expressed in a single binary blob. (With an overhead of 2 bytes per extension of another 4 KiB.) This encoding is not ideal when your primary concern is very large raw-bytes payloads.
There could be a maximum offset-correction of 16 for the 2-byte header, but this small extension to the range offers little room to actually matter. The offset of 1 is beneficial, because it allows expressing a size of 4096, i.e. 4 KiB.
NULL/null/nil is a common placeholder value to represent missing data. However, in the context of a programming language, this represent the placeholder value for memory that needs to be allocated regardless, because in some use-cases it will actually contain data. In the encoding, one can either leave out these values entirely, e.g. when using a map to represent a composite object, or one can use some form of empty (0-length bytes, empty sequence or empty map).

Implementation

Currently, only a Go implementation exists.

The specification should be clear and fixed, except possibly for an open item for consideration listed in section: status. If this post is not exact enough, please refer to the link above which refers to the package documentation that also contains an elaborate description of the encoding, as well as the constants defined for use in the Go-implementation.

There is strong emphasis on the should-requirement for “minimum necessary headersize”. This is intentional. Given this small overlap in values, it would be possible to use a “magic” value between [1,15] to signal a migration to a different encoding or an extension of this encoding. This isn’t a must-requirement, because there are no plans, and in rare cases it may be beneficial to default to 2 bytes of header, preferring consistency over space-efficiency. A migration could also be signaled by starting communication with a previously agreed “labeled” value to indicate that both parties can migrate to a different encoding, i.e. a solution is not dependent on a reserved “magic” values.

Changelog

This article will receive updates, if necessary.

2024-12-18 Use ‘must’ and emphasize ‘if-and-only-if’ in “additional rule (1.)”.
2024-12-08 Initial version.

Timelessness