Huffman coding

From ScienceZero
Jump to: navigation, search

A simple way of reducing the size of a block of data is to replace the most common words with shorter ones that are not already used in the data. It is possible to reconstruct the original data using a code book that lists the short words and the longer words they replace. The problem is to figure out which word combination will give the best possible compression.

Generating the Huffman codes using a binary tree

David A. Huffman found a way of generating optimal codes that guarantees the shortest possible output. It is done by creating a binary tree. Each node in the tree has at most two child nodes, distinguished as "left" and "right". Nodes with children are parent nodes, child nodes contain links to their parents

Make a table with one node for each symbol
Find the frequency of each symbol in the data

while there are at least two active nodes
  Find the two active nodes with the lowest frequency, these are the children
  Create a new parent node and link it to each of the two child nodes, let it contain the sum of the two children
  Mark the two child nodes as as inactive

Example

"Hello_World"

Symbol   Frequency
---------------
     H - 1
     e - 1
     W - 1
     r - 1
     d - 1
     _ - 1
     o - 2
     l - 3


Each symbol by frequency in ascending order from left to right:

H e W r d _ o l ← Symbols
1 1 1 1 1 1 2 3 ← Leaf nodes
\ / \ / \ / | |
 2   2   2  | |
  \ /     \ / |
   4       4  | 
    \       \ / ← Branch
     \	     7
      \     / 
       \   /
         11  ← Root node

The root node will contain the sum of all leaf nodes which is the same as the length of the message. This can be used as a checksum to make sure that all nodes have been visited. Now we can generate the code for each symbol, we assign 0 for left branch and 1 for right branch.

Symbol   Code
---------------
     H - 000
     e - 001
     W - 010
     r - 011
     d - 1000
     _ - 1001
     o - 101
     l - 11

The codes have the property that no code starts with the same bit pattern as the complete pattern of another code. This means that during the decoding we don't need to know the length of the next code, we just keep reading bits until we have enough to identify a leaf node in the tree.

If we encode the original text we get:

H   e   l  l  o   _    W   o   r   l  d  
000 001 11 11 101 1001 010 101 011 11 1000

The original message was 11 * 8 = 88 bits long, the compressed message is 32 bits long, a compression ratio of 88 / 32 = 2.75 : 1. This seems very good at first, but we need to include the Huffman tree in the message otherwise we would not be able to recover the original.

Storing the Huffman tree efficiently

Storing the symbol, the code and the code length takes quite a lot of room and reduces the compression ratio. It turns out that there are many different Huffman trees that will give the same compression ratio. To see how this is true, take any two codes of the same length and swap them around, it will not affect the compression ratio. If two symbols of the same frequency have different code lengths, swapping the codes will not change the compression ratio either. This redundancy is why storing the complete tree is inefficient.

It turns out that it is possible to reconstruct an optimal tree from just knowing the code lengths of each symbol.

Sort all symbols by code length in ascending order

code = 0
for each symbol
    print symbol, code, length
    code = (code + 1) << ((next symbol code length) - (current code length))

By using this algorithm in both the compressor and decompressor we are guaranteed to have the same tree generated in both and we only have to store the code lengths for each symbol. In the case of coding a relatively small range of symbols there is no need to store the symbols themselves. The code length is stored in the index that equals the symbol. So if each symbol is a byte we just store 256 code lengths, each can be stored in 4 bits, so 128 bytes in total.

Fast decompression

Reading the compressed data bit by bit and using it to navigate the tree is very slow. Since the start of each code is guaranteed to be unique for lengths of other codes it is possible to make a table that contains the symbols and the number of bits in each symbol. That way decompression is just a very fast table lookup.

Imagine a code 101 and maximum code length 12 bit. Fill the 512 addresses of the form 101xxxxxxxxx with the symbol for code 101. Then each lookup with an index that starts with the bits 101 is guaranteed to return the correct symbol. The code length stored in the table is used to shift the variable that contains the codes so that the next code is ready to be looked up.

Transforms to increase efficiency

Consider a string of the form 11111111222222223333333344444444, it would not compress with Huffman coding because each symbol has the same frequency in the string. The key is to find a simple transform that will transform the regularity in the data into irregularity in code frequency.

Move-to-front

Keeps a list of all symbols and outputs the index of each symbol transformed, that symbol is then moved to the top of the list. Common symbols will stay close to the top of the list and be represented by small values.

Variations of the transform can move the symbol towards the top, that way rare symbols will not push a very common symbol down from the top position.

create table of symbols

for each symbol
  search table for symbol
  print index of found symbol
  move range 0 to found symbol down one cell (overwrite found symbol)
  insert found symbol at index 0  

10000000200000003000000040000000

EOR

Storing the exclusive or of each byte and the byte before it.

previous = 0
for each symbol
  print symbol eor previous
  previous = symbol

10000000300000001000000070000000

Delta

Storing the difference between each byte.

previous = 0
for each symbol
  print symbol - previous
  previous = symbol

10000000100000001000000010000000

Burrows-Wheeler

This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original unsorted position. It is very efficient if followed with move-to-front.

Forward transform

Create all rotations of the input string "huffman"
Sort rows
Return the last column and the new row position of the input string 

huffman        anhuff m
uffmanh        ffmanh u
ffmanhu  sort  fmanhu f
fmanhuf   -->  huffma n  <-- New position is row 4
manhuff        manhuf f
anhuffm        nhuffm a
nhuffma        uffman h

Reverse transform

string = "mufnfah"
position = 4

Create an empty 2D array of dimension len(string) * len(string)
for len(string)
  insert string at first column
  sort rows
Return row at position 

1st iteration
 insert     sort
m......    a......
u......    f......
f......    f......
n......    h......
f......    m......
a......    n......
h......    u......

2nd iteration
 insert     sort
ma.....    an.....
uf.....    ff.....
ff.....    fm.....
nh.....    hu.....
fm.....    ma.....
an.....    nh.....
hu.....    uf.....

...

6th interation
 insert     sort
manhuf.    anhuff.
uffman.    ffmanh.
ffmanh.    fmanhu.
nhuffm.    huffma.
fmanhu.    manhuf.
anhuff.    nhuffm.
huffma.    uffman.

7th iteration
 insert     sort
manhuff    anhuffm
uffmanh    ffmanhu
ffmanhu    fmanhuf
nhuffma    huffman  <-- Result at row 4
fmanhuf    manhuff
anhuffm    nhuffma
huffman    uffmanh