Difference between revisions of "Huffman coding"

From ScienceZero
Jump to: navigation, search
(Example)
(Example)
Line 28: Line 28:
 
       \    /  
 
       \    /  
 
         \  /
 
         \  /
           11
+
           11 ← Root node
  
 
The final node will contain the sum of all other nodes which is the same as the length of the message. This can be used as a checksum to make sure that all nodes have been visited. Now we can find the code for each letter, we assign 0 for left branch and 1 for right branch.
 
The final node will contain the sum of all other nodes which is the same as the length of the message. This can be used as a checksum to make sure that all nodes have been visited. Now we can find the code for each letter, we assign 0 for left branch and 1 for right branch.

Revision as of 07:00, 25 January 2011

A simple way of reducing the size of a block of data is to replace the most common words with shorter ones that are not already used in the data. It is possible to reconstruct the original data using a dictionary that lists the short words and the longer words they replace.

The problem is to decide which codes to use, David A. Huffman found the optimal way of generating codes that guarantees the shortest possible output.

Count the number of times each word occurs in the data, this will be a histogram

do
  Find the two active nodes with the lowest count, these are the parents
  Create a new child node and link it to each of the two parent nodes, let it contain the sum of the two parents
  Mark the two parent nodes as as inactive
loop until only one active node remains

We will now have a binary tree structure.

Example

"Hello_World"

Each letter by frequency in ascending order from left to right:

H e W r d _ o l
1 1 1 1 1 1 2 3 ← Leaf nodes
\ / \ / \ / | |
 2   2   2  | |
  \ /     \ / |
   4       4  |
    \       \ /
     \	     7
      \     / 
       \   /
         11  ← Root node

The final node will contain the sum of all other nodes which is the same as the length of the message. This can be used as a checksum to make sure that all nodes have been visited. Now we can find the code for each letter, we assign 0 for left branch and 1 for right branch.

H - 000
e - 001
W - 010
r - 011
d - 1000
_ - 1001
o - 101
l - 11

The codes have the property that no code starts with the same bit pattern as the whole pattern of another code. That means that during the decoding we don't need to know the length of the next code, we just keep reading bits until we have enough to identify a leaf node in the tree.