Serialisation formats
Wikipedia has this to say about serialization:
In computing, serialization (or serialisation) is the process of translating a data structure or object state into a format that can be stored […]
A couple of notes on this:
- by “stored”, we invariably mean a byte/octet stream
- data structures always have different types of object that need to be encoded, e.g. trees are made up of nodes which contain data and edges
- at some level “data” invariably means a byte/octet stream
This is demonstrated particularly well by Recursive Length Prefix (RLP) who’s data structure, represented in Swift looks like this:
enum Element {
case string([UInt8])
case list([Element])
}
Alphabet bloat
It’s probably been argued elsewhere but I think the act of encoding data structures boiled down to one single problem: the act of craming a string in a large alphabet (or base if we’re using just numbers) into a smaller alphabet. For example, in the minimal example of RLP we need at least 256+2 characters: all of UInt8 and a way of distinguishing between strings and lists. There’s other stuff but the essential point is made.
As far as I can tell there are two approaches to this problem:
- a carefully constructed binary format such as Protobuf, JSON, or ASN1,
- use a dictionary based compressor such as Lempel–Ziv–Welch with a dictionary limited to the next binary block size, e.g. 256+x → 65536 or, my personal favourite, 16+x → 256.
Example
Here’s a simple encoding containing the very basics of JSON (or RLP + null). In the above parlence the alphabet size is 16+4.
enum JSON0: Int {
case pad, null, push, pop
case _0, _1, _2, _3, _4, _5, _6, _7, _8, _9, _a, _b, _c, _d, _e, _f
}
To show that this works I encoded a range of file using JSON+LZW and the normal GZIP. The results don’t suck although I’d be facinated to see a more bespoke table management strategy.