Huffman coding
A simple way of reducing the size of a block of data is to replace the most common words with shorter ones that are not already used in the data. It is possible to reconstruct the original data using a code book that lists the short words and the longer words they replace. The problem is to figure out which word combination will give the best possible compression.
Contents
Generating the Huffman codes using a binary tree
David A. Huffman found the optimal way of generating codes that guarantees the shortest possible output. It is done by creating a binary tree. Each node in the tree has at most two child nodes, distinguished as "left" and "right". Nodes with children are parent nodes, child nodes contain links to their parents
Make a table with one node for each symbol Find the frequency of each symbol in the data do while there are at least two active nodes Find the two active nodes with the lowest frequency, these are the children Create a new parent node and link it to each of the two child nodes, let it contain the sum of the two children Mark the two child nodes as as inactive loop
Example
"Hello_World" Symbol Frequency --------------- H - 1 e - 1 W - 1 r - 1 d - 1 _ - 1 o - 2 l - 3
Each symbol by frequency in ascending order from left to right:
H e W r d _ o l ← Symbols 1 1 1 1 1 1 2 3 ← Leaf nodes \ / \ / \ / | | 2 2 2 | | \ / \ / | 4 4 | \ \ / ← Branch \ 7 \ / \ / 11 ← Root node
The root node will contain the sum of all leaf nodes which is the same as the length of the message. This can be used as a checksum to make sure that all nodes have been visited. Now we can generate the code for each symbol, we assign 0 for left branch and 1 for right branch.
Symbol Code --------------- H - 000 e - 001 W - 010 r - 011 d - 1000 _ - 1001 o - 101 l - 11
The codes have the property that no code starts with the same bit pattern as the complete pattern of another code. This means that during the decoding we don't need to know the length of the next code, we just keep reading bits until we have enough to identify a leaf node in the tree.
If we encode the original text we get:
H e l l o _ W o r l d 000 001 11 11 101 1001 010 101 011 11 1000
The original message was 11 * 8 = 88 bits long, the compressed message is 32 bits long, a compression ratio of 88 / 32 = 2.75 : 1. This seems very good at first, but we need to include the Huffman tree in the message otherwise we would not be able to recover the original.
Storing the Huffman tree efficiently
Storing the symbol, the code and the code length takes quite a lot of room and reduces the compression ratio. It turns out that there are many different Huffman trees that will give the same compression ratio. To see how this is true, take any two codes of the same length and swap them around, it will not affect the compression ratio. If two symbols of the same frequency have different code lengths, swapping the codes will not change the compression ratio either. This redundancy is why storing the complete tree is inefficient.
It turns out that it is possible to reconstruct an optimal tree from just knowing the code lengths of each symbol.
Sort all symbols by code length in ascending order code = 0 for each symbol print symbol, code code = (code + 1) << ((next code length) - (current code length)) loop
By using this algorithm in both the compressor and decompressor we are guaranteed to have the same tree generated in both and we only have to store the code lengths for each symbol. In the case of coding a relatively small range of symbols there is no need to store the symbols themselves. The code length is stored in the index that equals the symbol. So if each symbol is a byte we just store 256 code lengths, each can be stored in 4 bits, so 128 bytes in total.
Fast decompression
Reading the compressed data bit by bit and using it to navigate the tree is very slow. Since the start of each code is guaranteed to be unique for lengths of other codes it is possible to make a table that contains the symbols and the number of bits in each symbol. That way decompression is just a very fast table lookup.
Imagine a code 101 and maximum bit length 12 bits. Fill the 512 addresses of the form 101xxxxxxxxx with the symbol for code 101. Then each lookup with an index that starts with the bits 101 is guaranteed to return the correct symbol. The code length stored in the table is used to shift the variable that contains the codes so that the next code is ready to be looked up.
Transforms to increase efficiency
Consider a string of the form 11111111222222223333333344444444, it would not compress with Huffman coding because each symbol has the same frequency in the string. The key is to find a simple transform that will transform the regularity in the data into irregularity in code frequency.
Move-to-front
Keeps a list of all symbols and outputs the index of each symbol transformed, that symbol is then moved to the top of the list. Common symbols will stay close to the top of the list and be represented by small values.
Variations of the transform can move the symbol towards the top, that way rare symbols will not push a very common symbol down from the top position.
create table of symbols for each symbol search table for symbol print index of found symbol move range 0 to found symbol down one cell (overwrite found symbol) insert found symbol at index 0 loop 10000000200000003000000040000000
EOR
Storing the exclusive or of each byte and the byte before it.
previous = 0 for each symbol print symbol eor previous previous = symbol loop 10000000300000001000000070000000
Delta
Storing the difference between each byte.
previous = 0 for each symbol print symbol - previous previous = symbol loop 10000000100000001000000010000000
Burrows-Wheeler
This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original position. If followed with move-to-front it is very efficient.
Forward transform
Create all rotations of the input string "huffman" Sort rows Return the last column and the new row position of the input string huffman anhuffm uffmanh ffmanhu ffmanhu sort fmanhuf fmanhuf --> huffman <-- New position is row 4 manhuff manhuff anhuffm nhuffma nhuffma uffmanh
Reverse transform
string = "mufnfah" position = 4 Create an empty 2D array of dimension len(string) * len(string) For len(string) insert string at first column Sort rows Next Return row at position 1st iteration insert sort m...... a...... u...... f...... f...... f...... n...... h...... f...... m...... a...... n...... h...... u...... 2nd iteration insert sort ma..... an..... uf..... ff..... ff..... fm..... nh..... hu..... fm..... ma..... an..... nh..... hu..... uf..... ... 6th interation insert sort manhuf. anhuff. uffman. ffmanh. ffmanh. fmanhu. nhuffm. huffma. fmanhu. manhuf. anhuff. nhuffm. huffma. uffman. 7th iteration insert sort manhuff anhuffm uffmanh ffmanhu ffmanhu fmanhuf nhuffma huffman <-- Answer at row 4 fmanhuf manhuff anhuffm nhuffma huffman uffmanh