jjrscott

Serialisation formats

Wikipedia has this to say about serialization:

In computing, serialization (or serialisation) is the process of translating a data structure or object state into a format that can be stored […]

A couple of notes on this:

  1. by “stored”, we invariably mean a byte/octet stream
  2. data structures always have different types of object that need to be encoded, e.g. trees are made up of nodes which contain data and edges
  3. at some level “data” invariably means a byte/octet stream

This is demonstrated particularly well by Recursive Length Prefix (RLP) who’s data structure, represented in Swift looks like this:

enum Element {
    case string([UInt8])
    case list([Element])
}

Alphabet bloat

It’s probably been argued elsewhere but I think the act of encoding data structures boiled down to one single problem: the act of craming a string in a large alphabet (or base if we’re using just numbers) into a smaller alphabet. For example, in the minimal example of RLP we need at least 256+2 characters: all of UInt8 and a way of distinguishing between strings and lists. There’s other stuff but the essential point is made.

As far as I can tell there are two approaches to this problem:

  1. a carefully constructed binary format such as Protobuf, JSON, or ASN1,
  2. use a dictionary based compressor such as Lempel–Ziv–Welch with a dictionary limited to the next binary block size, e.g. 256+x → 65536 or, my personal favourite, 16+x → 256.

Example

Here’s a simple encoding containing the very basics of JSON (or RLP + null). In the above parlence the alphabet size is 16+4.

enum JSON0: Int {
    case pad, null, push, pop
    case _0, _1, _2, _3, _4, _5, _6, _7, _8, _9, _a, _b, _c, _d, _e, _f
}

To show that this works I encoded a range of file using JSON+LZW and the normal GZIP. The results don’t suck although I’d be facinated to see a more bespoke table management strategy.

See also