Encoding: prefixed-compact

❝A new, compact, prefix-based binary encoding for structuring bytes.❞
Contents

prefixed-compact is a prefix-based binary encoding with a limited, compact header for structured serialization of bytes.

Introduction

The starting point were TLV, LTV, and similar “on-the-wire” value-representations. TLVs and LTVs are straight-forward, there is little overhead and they are easy to read. However, they are also very limited. They encode a type (which an application may not know/understand/support) and a length and value. The length allows skipping over the “value”-body, such that unsupported/unknown types can be ignored and skipped over.

This encoding aims for a similar minimal level of features, setting its scope at structuring bytes.

Encodings like JSON, YAML, XML, and many more provide interesting features and sport amazing capabilities at value control, but at the cost of a lot of complexity. JSON is easy to understand, but is a text-based encoding thus inefficient for raw bytes of data. Due to its start- and end-tags for composite types, there is need for proper matching. However, JSON’s ability to express structure with its arrays and objects, is quite powerful. JSON offers a few data-types giving it an elementary ability to express semantics, but no restrictions, at least not without JSON-Schema. XML provides many capabilities, but has these quirks where, for example, XML attributes don’t really have a unique purpose, such that – syntactically speaking – attributes can always also be expressed as nested elements. It is only when a schema or application enforces one mechanism that a choice is made, instead of by necessity. Furthermore, XML inherently recognizes order of child elements, meaning that this notion of order is ever-present even if it is not a prime concern. (Yes, IIRC, this can be controlled or redefined in a schema.) JSON, on the other hand, doesn’t recognize this notion of order for object fields, thus giving its objects one specific purpose: grouping related fields, identified by key, and (sometimes) composing the implied composite structure.

Many more elaborate serialization formats and IPC protocols are very powerful and efficient, but require defining an interface, possibly a schema beforehand, possibly executing a compiler to produce machine-generated code to adopt. None of these things are bad, per se, but they do make things vastly more elaborate and more complicated.

Status

Status: the specification is essentially fixed.

Open considerations:

Design

I noticed that there wasn’t a convenient, straight-forward, go-to mechanism for structuring bytes to send “over-the-wire” for IPC and data-transfers that is least-intrusive and efficient for one-off use. Using the basic notion of TLV (type-length-value) as starting point and appreciating the ability to structure bytes, e.g. collections of data.

Target: a minimal, structured format for bytes

  1. Binary encoding (strict, efficient, compact)
    Be strict.
  2. Prefix-based, with header
    A small prefixed header defines what to expect of the value, including counts -the number of bytes or entries- and structure.
  3. Structured
    Some structure is necessary to avoid a need for conventions on structure. Sequences of dynamic sizes, any number of values. Maps to express key-value pairings. Nesting. Ability to indicate the element count.
  4. Support for raw bytes, i.e. no space-loss due to a need for (nested) encoding
    Bytes makes transferring data efficient. Most data can be converted to or encoded into bytes already. No need for (nested) encoding to store raw data.
  5. Limited scope: no data-types, no null, no interpretation, no restrictions, no validation
    We carry bytes in structured format. Interpretation of the values is left to the applications.
  6. Support primarily small values
    Structured data often comes in small values or small numbers of elements.
  7. Ability to handle (extendable to) arbitrarily large values and numbers of elements.
    Ability to store large byte-arrays when necessary, although with some overhead. Avoid unnecessary limitations.
  8. No unnecessary restrictions on available mechanisms.
    The few mechanisms offered by the encoding may be applied as needed, avoiding unnecessary restrictions. Orthogonality of bit-flags.

This article discusses the decisions and thoughts behind decisions. For the specifics, the documentation covers this quite nicely.

Formal specification

A (somewhat) formal specification for the encoding.

value = data | (key , value) | sequence | map ;
data = header , {byte} ;
key = header , {byte} ;
sequence = header , {value} ;
map = header , {key , value} ;
header = termination-bit , valuetype-bit , multiplicity-bit ,
    headersize-bit , 4-bit-size , [8-bit-size] ;

with

additionally

  1. header must include the second byte, as part of big-endian unsigned size (12 bits in total), if-and-only-if the headersize-bit is set. The 4-bit size ranges [0,15]. The 12-bit size ranges [1,4096], thus requiring an offset-correction by 1.
  2. if termination-bit is unset, there must follow a value of same valuetype | multiplicity.
  3. length of raw data ({byte}) in data and key must correspond to length indicated in header; actual number of entries in sequence and map must correspond to count indicated in header.
  4. the minimum necessary header-size should be used, i.e. sizes/counts ranging [0,15] expressed with single-byte header.

The prefixed header defines how the next bytes are to be interpreted. It consists of 4 bit-flags and a big-endian unsigned integer value of 4 or 12 bits indicating a count, i.e. a number of bytes or elements.

Structure:

termination-bit || valuetype-bit || multiplicity-bit || headersize-bit || 4-bits-of-size || [8-bits-of-size]

Size (count) is stored as an 1/2-byte big-endian unsigned integer, naturally leaving room for flags in the most-significant bits.

The 4-bit/12-bit value represents a count rather than a number of bytes. The count of a plain value represents a size in number of bytes, and for a key-value-pair represents the number of bytes of only the key. When the multiplicity-bit is set, the count represents the number of entries (values or key-value-pairs) in the sequence or map, respectively.

Body

Once the header is read, a corresponding body must follow. The two bits, valuetype and multiplicity, together indicate one of four types of value:

In all cases, the termination-bit indicates whether this record concludes the value. If the termination-bit is not set on a value, the follow up, which must be of the same type, extends the first value. This means a map can contain many multiples of 4096 entries even if the encoding is limited to expressing a count of at most 4096. The termination-bit may be left unset for smaller sizes too. For example, leaving the termination-bit unset would allow indefinite streaming of additional values, only to be terminated at the end on an empty (0-sized) value.

Examples

Some examples to demonstrate the expressiveness of the encoding, and its results as bytes. The symbolic representation illustrates how flags are used in the encoding, making results easier to read, while the raw bytes representation shows actual results as raw bytes.

As a reminder, the following bit-flags are used:

TERMINATION  = 0b10000000 (0x80, 128)
VALUETYPE    = 0b01000000 (0x40,  64)
MULTIPLICITY = 0b00100000 (0x20,  32)
HEADERSIZE   = 0b00010000 (0x10,  16)

Note: 2-byte headers have value offset by 1.

In these examples, the symbolic representation uses ‘|’ to denote bit-wise OR. For example, “5|TERMINATION” = 0x85 or 133.

explanation: a byte-array, first byte is the header with size and the termination-flag. Then follow the content-bytes.

explanation: a key-value-pair, with header indicating key properties, followed by the key. Then another value must follow, as specified, in this case a plain byte-array.

explanation: a _sequence of values. In this case a sequence of length 3, followed by 3 1-byte plain values.

explanation: the map contains 2 key-value-pairs, each key-value-pair by specification consists of first the key, then some arbitrary value-type. These count as 1 key-value entry in the map. Next follows the second key-value entry.

explanation: a series of 0-byte unterminated plain values, with occasional 1-byte value in between, only terminated at the last (20th) plain value.

explanation: a series of elements in a sequence. There are 16 elements, requiring a 2-byte header to express that number of elements. Note the offset by 1 to allow for a maximum of 4096 elements. Because of big-endianness, the first header-byte contains the highest size-bits, all zero in this case.

Characteristics and capabilities

Let’s discuss some key highlights of possibilities with the bit-flags provided. Consider that many of these ideas can be taken into extremes that are wildly impractical. The benefit is not in the extremes but in the control that they offer. Given no unnecessary restrictions, it is possible to work with large batches or just the opposite, depending on your use-case. Embedded use-cases may want to trade little bits of overhead with extra headers and smaller values, while powerful computers may want to benefit from larger sizes and repeated extensions. The encoding aims to define only those restrictions necessary to make concept work.

Prefixed header, provides information on the raw bytes immediately following. There are no special characters, no need to escape, no restriction on the range of byte-values.

Prefixed header with no closing tags/tokens make it possible to process data in streaming fashion. It also means there is no redundancy on structural information to correct or detect encoding errors.

Structure for bytes, this encoding is designed to carry bytes across. The available features, though limited, are selected to handle small and large values, plain values and collections of values, unknown and/of varying numbers of bytes and/or elements, extending existing values, and to vary/adapt the (parts of) values to suitable sizes given unknown/varying circumstances/capacities.

Avoiding convention, by introducing a key-value-pair and multiplicity.

Avoiding interpretation of values means that we do not prescribe whether a value must be expressed as big-endian or little-endian, or in 8 bytes or 4 bytes, or whether floating points are an acceptable data-type. Values essentially “carry bytes across in structured format”. Everything else is left to the application to determine as best suited.

Implicit order of occurrence in the data-stream. Although not expressed within the value-bytes, the order of occurrence may matter if you choose so. Sequences may express priority or age based on the position in the list, if needed. If not, a sequence may represent a set of elements.

Freedom of interpretation: there is no enforced interpretation of value-bytes. For example, duplicates are not filtered/detected on-the-fly, so it is left to the applications to define semantics for duplicate occurrences of keys in a map-type. Either, reject the data, or replace the previous value, or concatenate to the previous value, or add as additional element to the same key. The termination-bit allows extending a value, without prescribing the semantics of the bytes being carried. This includes repeated elements in sequences, repeated keys in maps, etc.

Expression of small and large values is possible using the headersize-bit. With the bit unset, a 1-byte header is expected, consisting of 4 flag-bits and 4 bits left over to express sizes, i.e. [0, 15]. With the headersize-bit set, an additional byte of header is available, making it possible to express sizes up to 12 bits. Given that we can already express values up to 15, 12-bits are defined to be in range [1, 4096], requiring an offset-correction.

Arbitrarily large values and collections: although support is offered to immediately express any number from 0 to 4096, the proper use of the termination-bit will allow you to extend the value. 4096 Is far from “incomprehensibly large” but can be extended indefinitely. Even if 4 KiB is relatively small when plain bytes are concerned, 4096 elements are already a significant number for sequences and maps. Consider that both sequences and maps define their elements as value, meaning that nested structure is allowed. Note that plain bytes are expressed in chunks of 4 KiB, requiring no encoding, and by using the termination-bit can be extended up to infinity, resulting in 2 bytes of overhead per 4 KiB.

Arbitrarily partitioned value-data: the termination-bit allows dividing up the expression of large values and collection (entries) arbitrarily. The data, whether bytes or entries, may be partitioned in whichever way is suitable, with numbers from 0 to 4096. Until the termination-bit is set, a next partition of the same type will always be expected.

Adaptability to unknown number of elements: given that the termination-bit concludes a value, it is possible to encode/stream structured data without knowing exact sizes beforehand, without even knowing intermediate sizes or with varying intermediate sizes. Until the termination-bit is specified, any value can be extended arbitrarily.

Redundancy, error-correction or matchable indicators are out-of-scope, meaning that if there is risk of data-corruption, additional measures need to be taken outside of the domain of this encoding. Similarly, verification is only possible to the extent that some violations of encoding rules might be detected in the encoded data.

No unnecessary restrictions: the encoding does not enforce unnecessary restrictions. This means, for example, that keys can become arbitrarily large. There will probably be only very few cases where this is useful/necessary, but it isn’t forbidden. It will undoubtedly become a problem in the embedded ecosystem. It is recommended to define custom limits for this ecosystem, rather than have the encoding make things complicated by defining arbitrary restrictions on top of necessary control-bits. Embedded systems, in their turn, can simply reject encodings that violate their restrictions or that reach beyond their capacity, without this being a concern of the encoding itself.

Trade-offs

The following trade-offs were evaluated. Some rationale is provided.

Implementation

Currently, only a Go implementation exists.

The specification should be clear and fixed, except possibly for an open item for consideration listed in section: status. If this post is not exact enough, please refer to the link above which refers to the package documentation that also contains an elaborate description of the encoding, as well as the constants defined for use in the Go-implementation.

There is strong emphasis on the should-requirement for “minimum necessary headersize”. This is intentional. Given this small overlap in values, it would be possible to use a “magic” value between [1,15] to signal a migration to a different encoding or an extension of this encoding. This isn’t a must-requirement, because there are no plans, and in rare cases it may be beneficial to default to 2 bytes of header, preferring consistency over space-efficiency. A migration could also be signaled by starting communication with a previously agreed “labeled” value to indicate that both parties can migrate to a different encoding, i.e. a solution is not dependent on a reserved “magic” values.

Changelog

This article will receive updates, if necessary.