jjrscott

Leaky wire formats

Leaky wire formats

I’ve recently been working with Protocol Buffers and noticed what I think is an issue: there is a leaky abstraction in the wire format. Both data blobs and child messages are encoded with wire type delimited; there is no way to differentiate between them when decoding.

Below is a summary of the JSON wire format with each case being a uniquely decodable unit:

enum JSONObject {
    case string(String)
    case array([JSONObject])
    case object([(String, JSONObject)])
    case true
    case false
    case null
    case number(Float | Integer)
}

There’s seven different types here, but the only thing the parser can’t work out if a number was a floating point number or an integer.

Below is a summary of Protocol Buffer’s Condensed Reference Card for its wire format:

enum PBElement {
    case varint(UInt32 | UInt64 | SInt32 | SInt64 | Bool)
    case fixed32bit(UInt32 | SInt32 | Float)
    case fixed64bit(UInt64 | SInt64 | Double)
    case delimited(Data | String | [(UInt, PBElement)] | ...)
    case group([(UInt, PBElement)])
}

For some, almost certainly sensible reason, version 3 of the specification deprecates group in favour of delimited. However, this means that a Protocol Buffers Definition is required to decode from the low level wire format.

The bare minimum

To my mind the atomic units of any recursive/context-free encoding format are a single terminal together with a single non-terminal:

enum Element {
    case blob(Data)
    case message([Element])
}

The compromise

Now for various reasons such as Swift’s Codable I’ve found it much easier to use the following, not quite minimal, wire format:

enum Element {
    case blob(Data)
    case message([(UIInt, Element)])
}

In this case the non-terminal case encodes as series of values keyed by an unsigned integer. I’m not going to lay out my actual implementation here because I don’t want to take away from my main point - the importance of a non-leaking abstraction.