Difference between revisions of "Huffman coding"

From ScienceZero
Jump to: navigation, search
(Transforms to increase efficiency)
(Move-to-front)
Line 54: Line 54:
  
 
=== Move-to-front ===
 
=== Move-to-front ===
 +
Keeps a list of all symbols and outputs the index of the symbols, when a symbol is used it moves it to the top of the list. Common symbols will stay at the top of the list and be represented by small values.
 +
 +
Variations of the transform can move the symbol towards the top, that way rare symbols will not push a very common symbol down from the top position.
 +
 
=== EOR ===
 
=== EOR ===
 
Storing the exclusive or of each byte and the byte before it.
 
Storing the exclusive or of each byte and the byte before it.

Revision as of 08:51, 25 January 2011

A simple way of reducing the size of a block of data is to replace the most common words with shorter ones that are not already used in the data. It is possible to reconstruct the original data using a dictionary that lists the short words and the longer words they replace.

The problem is to decide which codes to use, David A. Huffman found the optimal way of generating codes that guarantees the shortest possible output.

Count the number of times each word occurs in the data, this will be a histogram

do
  Find the two active nodes with the lowest count, these are the children
  Create a new parent node and link it to each of the two child nodes, let it contain the sum of the two children
  Mark the two child nodes as as inactive
loop until only one active node remains

We will now have a data structure called a binary tree. Each node has at most two child nodes, distinguished as "left" and "right". Nodes with children are parent nodes, child nodes contain links to their parents.

Example

"Hello_World"

Each letter by frequency in ascending order from left to right:

H e W r d _ o l
1 1 1 1 1 1 2 3 ← Leaf nodes
\ / \ / \ / | |
 2   2   2  | |
  \ /     \ / |
   4       4  |
    \       \ /
     \	     7
      \     / 
       \   /
         11  ← Root node

The root node will contain the sum of all leaf nodes which is the same as the length of the message. This can be used as a checksum to make sure that all nodes have been visited. Now we can find the code for each letter, we assign 0 for left branch and 1 for right branch.

H - 000
e - 001
W - 010
r - 011
d - 1000
_ - 1001
o - 101
l - 11

The codes have the property that no code starts with the same bit pattern as the complete pattern of another code. This means that during the decoding we don't need to know the length of the next code, we just keep reading bits until we have enough to identify a leaf node in the tree.

If we encode the original text we get:

H   e   l  l  o        W   o   r   l  d  
000 001 11 11 101 1001 010 101 011 11 1000

The original message was 11 * 8 = 88 bits long, the compressed message is 32 bits long, a compression ratio of 88 / 32 = 2.75 : 1. This seems very good at first, but we need to include the Huffman tree in the message otherwise we would not be able to recover the original.

Storing the Huffman tree efficiently

Transforms to increase efficiency

Consider a string of the form 11111111222222223333333344444444, it would not compress with Huffman coding because each symbol have the same frequency in the string. The key is to find a simple transform that will transform the regularity in the data into making some symbols much more frequent.

Move-to-front

Keeps a list of all symbols and outputs the index of the symbols, when a symbol is used it moves it to the top of the list. Common symbols will stay at the top of the list and be represented by small values.

Variations of the transform can move the symbol towards the top, that way rare symbols will not push a very common symbol down from the top position.

EOR

Storing the exclusive or of each byte and the byte before it. 10000000300000001000000070000000

Delta

Storing the difference between each byte. 10000000100000001000000010000000

Burrows-Weeler

This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original position. If followed with move-to-front it is very efficient.