Understanding Deflate
I’m trying to understand how Deflate works so decided
to compress a simple string TOBEORNOTTOBEORTOBEORNOT
using GZIP
then decode the resulting file by hand.
Compressing the data
Pretty simple here, text in bytes out:
$ echo -n 'TOBEORNOTTOBEORTOBEORNOT' | gzip -n | xxd -ps -u
1F8B08000000000000030BF17772F50FF2F30F09013342605C00F14E3D2D
18000000
Reading the GZIP data
Even though I’m really interested in the compressed data I have to decode the GZIP “wrapper” in order to get at the juicy compressed data. Fortunately the Wikipedia page has the neccessary details:
1F8B Magic number.
08 Compression method. Must be 8 (Deflate).
00 Flags 0 = no flags
00000000 Unix time when the file was last modified. 0 means no timestamp is available.
00 Extra flags. 0: None (default value)
03 Filesystem on which compression occurred. 3: Unix
0BF17772 The compressed data.
F50FF2F3 The compressed data.
0F090133 The compressed data.
42605C00 The compressed data.
F14E3D2D CRC-32 (ISO 3309) of the uncompressed data.
18000000 Size (in bytes) of the uncompressed data = 24
Decoding the compressed data
Here we have to refer directly to the DEFLATE Compressed Data Format Specification version 1.3.
Skipping ahead, we can see that there is only one block (deflate allows multiple) which was encoded with “fixed Huffman codes” used to encode LZ77 tokens of the form:
enum LZ77Token {
case character(literal: UInt8)
case copyCommand(length: UInt, distance: UInt)
case endOfBlock // In deflate there is also one extra code `256` which marks the end of the block.
}
Characters are encoded as a single symbol while the copy command is encoded as follows:
struct CopyCommand {
let huffmanCode // Code between 257 and 285 (inclusive)
let extraLengthBits // as defined by the huffman code, eg code 276 requires 3 extra bits
let distanceCode // 5 bits which define a code between 0 and 29 (inclusive)
let distanceBits // as defined by the distance code, eg code 20 requires 9 extra bits
}
1 BFINAL is set if and only if this is the last block of the data set.
10 BTYPE specifies how the data are compressed, 10 - compressed with fixed Huffman codes
10000100 Symbol code 84: "T"
01111111 Symbol code 79: "O"
01110010 Symbol code 66: "B"
01110101 Symbol code 69: "E"
01111111 Symbol code 79: "O"
10000010 Symbol code 82: "R"
01111110 Symbol code 78: "N"
01111111 Symbol code 79: "O"
10000100 Symbol code 84: "T"
10000100 Symbol code 84: "T"
0000011 Symbol code 259, extra bits 0, length 5 (from input this should be "OBEOR" with distance 9)
00110 Distance code 6, extra bits 2, distance 9-12
00 Distance 0 -> 9 -> "OBEOR"
10000100 Symbol code 84: "T"
0000110 Symbol code 262, extra bits 0, length 8 (from input this should be "OBEORNOT" with distance ?)
00111 Distance code 7, extra bits 2, distance 13-16
01 Distance 1 -> 15 -> "OBEORNOT" (feels like it should be 14 but I think the bits must be the opposite "endian" here, ie 2 instead of one)
0000000 Symbol code 256 (end of data)
00 Padding to the align to the byte boundary
Wrapping up
Further simplfying the output we can see there were various literal symbols and copy commands:
let tokens: [LZ77Token] = [
.character(literal: 0x54), // "T"
.character(literal: 0x4F), // "O"
.character(literal: 0x42), // "B"
.character(literal: 0x45), // "E"
.character(literal: 0x4F), // "O"
.character(literal: 0x52), // "R"
.character(literal: 0x4E), // "N"
.character(literal: 0x4F), // "O"
.character(literal: 0x54), // "T"
.character(literal: 0x54), // "T"
.copyCommand(length: 5, distance: 9), // "OBEOR"
.character(literal: 0x54), // "T"
.copyCommand(length: 8, distance: 15), // "OBEORNOT"
.endOfBlock,
]
In summary, encoding using bits instead of bytes can be quite effecrtive in reducing data length, even in the short case the data was reduced from 24 to 16 bytes (strictly 15.75 without padding). That said, even in this simple case, decoding by hand was a pain.