Leaky wire formats
Leaky wire formats
I’ve recently been working with Protocol Buffers and noticed what I think is an issue: there is a leaky abstraction in the wire format. Both data blobs and child messages are encoded with wire type delimited
; there is no way to differentiate between them when decoding.
Below is a summary of the JSON wire format with each case
being a uniquely decodable unit:
enum JSONObject {
case string(String)
case array([JSONObject])
case object([(String, JSONObject)])
case true
case false
case null
case number(Float | Integer)
}
There’s seven different types here, but the only thing the parser can’t work out if a number was a floating point number or an integer.
Below is a summary of Protocol Buffer’s Condensed Reference Card for its wire format:
enum PBElement {
case varint(UInt32 | UInt64 | SInt32 | SInt64 | Bool)
case fixed32bit(UInt32 | SInt32 | Float)
case fixed64bit(UInt64 | SInt64 | Double)
case delimited(Data | String | [(UInt, PBElement)] | ...)
case group([(UInt, PBElement)])
}
For some, almost certainly sensible reason, version 3 of the specification deprecates group
in favour of delimited
. However, this means that a Protocol Buffers Definition is required to decode from the low level wire format.
The bare minimum
To my mind the atomic units of any recursive/context-free encoding format are a single terminal together with a single non-terminal:
enum Element {
case blob(Data)
case message([Element])
}
The compromise
Now for various reasons such as Swift’s Codable I’ve found it much easier to use the following, not quite minimal, wire format:
enum Element {
case blob(Data)
case message([(UIInt, Element)])
}
In this case the non-terminal case encodes as series of values keyed by an unsigned integer. I’m not going to lay out my actual implementation here because I don’t want to take away from my main point - the importance of a non-leaking abstraction.