jjrscott

Understanding Deflate

I’m trying to understand how Deflate works so decided to compress a simple string TOBEORNOTTOBEORTOBEORNOT using GZIP then decode the resulting file by hand.

Compressing the data

Pretty simple here, text in bytes out:

$ echo -n 'TOBEORNOTTOBEORTOBEORNOT' | gzip -n | xxd -ps -u
1F8B08000000000000030BF17772F50FF2F30F09013342605C00F14E3D2D
18000000

Reading the GZIP data

Even though I’m really interested in the compressed data I have to decode the GZIP “wrapper” in order to get at the juicy compressed data. Fortunately the Wikipedia page has the neccessary details:

1F8B       Magic number.
08         Compression method. Must be 8 (Deflate).
00         Flags 0 = no flags
00000000   Unix time when the file was last modified. 0 means no timestamp is available.
00         Extra flags. 0: None (default value)
03         Filesystem on which compression occurred. 3: Unix
0BF17772   The compressed data.
F50FF2F3   The compressed data.
0F090133   The compressed data.
42605C00   The compressed data.
F14E3D2D   CRC-32 (ISO 3309) of the uncompressed data.
18000000   Size (in bytes) of the uncompressed data = 24

Decoding the compressed data

Here we have to refer directly to the DEFLATE Compressed Data Format Specification version 1.3.

Skipping ahead, we can see that there is only one block (deflate allows multiple) which was encoded with “fixed Huffman codes” used to encode LZ77 tokens of the form:

enum LZ77Token {
    case character(literal: UInt8)
    case copyCommand(length: UInt, distance: UInt)
    case endOfBlock // In deflate there is also one extra code `256` which marks the end of the block.
}

Characters are encoded as a single symbol while the copy command is encoded as follows:

struct CopyCommand {
    let huffmanCode // Code between 257 and 285 (inclusive)
    let extraLengthBits // as defined by the huffman code, eg code 276 requires 3 extra bits
    let distanceCode // 5 bits which define a code between 0 and 29 (inclusive)
    let distanceBits // as defined by the distance code, eg code 20 requires 9 extra bits
}
1          BFINAL is set if and only if this is the last block of the data set.
10         BTYPE specifies how the data are compressed, 10 - compressed with fixed Huffman codes
10000100   Symbol code 84: "T"
01111111   Symbol code 79: "O"
01110010   Symbol code 66: "B"
01110101   Symbol code 69: "E"
01111111   Symbol code 79: "O"
10000010   Symbol code 82: "R"
01111110   Symbol code 78: "N"
01111111   Symbol code 79: "O"
10000100   Symbol code 84: "T"
10000100   Symbol code 84: "T"
0000011    Symbol code 259, extra bits 0, length 5 (from input this should be "OBEOR" with distance 9)
00110      Distance code 6, extra bits 2, distance 9-12
00         Distance 0 -> 9 -> "OBEOR"
10000100   Symbol code 84: "T"
0000110    Symbol code 262, extra bits 0, length 8 (from input this should be "OBEORNOT" with distance ?)
00111      Distance code 7, extra bits 2, distance 13-16
01         Distance 1 -> 15 -> "OBEORNOT" (feels like it should be 14 but I think the bits must be the opposite "endian" here, ie 2 instead of one)
0000000    Symbol code 256 (end of data)
00         Padding to the align to the byte boundary

Wrapping up

Further simplfying the output we can see there were various literal symbols and copy commands:

let tokens: [LZ77Token] = [
    .character(literal: 0x54), // "T"
    .character(literal: 0x4F), // "O"
    .character(literal: 0x42), // "B"
    .character(literal: 0x45), // "E"
    .character(literal: 0x4F), // "O"
    .character(literal: 0x52), // "R"
    .character(literal: 0x4E), // "N"
    .character(literal: 0x4F), // "O"
    .character(literal: 0x54), // "T"
    .character(literal: 0x54), // "T"
    .copyCommand(length: 5, distance: 9), // "OBEOR"
    .character(literal: 0x54), // "T"
    .copyCommand(length: 8, distance: 15), // "OBEORNOT"
    .endOfBlock,
]

In summary, encoding using bits instead of bytes can be quite effecrtive in reducing data length, even in the short case the data was reduced from 24 to 16 bytes (strictly 15.75 without padding). That said, even in this simple case, decoding by hand was a pain.