Difference between revisions of "Huffman coding"

From ScienceZero
Jump to: navigation, search
(Burrows-Wheeler)
(Burrows-Wheeler)
Line 125: Line 125:
 
This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original position. If followed with move-to-front it is very efficient.
 
This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original position. If followed with move-to-front it is very efficient.
  
  Create all rotations of the input
+
==== Forward transform ====
 +
  Create all rotations of the input string "huffman"
 
  Sort rows by first column
 
  Sort rows by first column
  Return the last column
+
  Return the last column and the new position of the input string
 
   
 
   
  Huffman       Huffman 
+
  huffman       anhuffm
  uffmanH       anHuffm
+
  uffmanh       ffmanhu
  ffmanHu        ffmanHu
+
  ffmanhu sort  fmanhuf
fmanHuf sort  fmanHuf
+
  fmanhuf   -->  huffman  <-- New position (4)
  manHuff   -->  manHuff
+
manhuff        manhuff
  anHuffm       nHuffma
+
  anhuffm       nhuffma
  nHuffma       uffmanH
+
  nhuffma       uffmanh
  
Last column is "nmuffaH", if we record which row the original data appears in we have enough information to reconstruct the original text.
+
==== Reverse transform ====
 +
string = "mufnfah"
 +
position = 4
 +
Create an empty 2D array of dimension len(string) * len(string)
 +
for len(string)
 +
  insert string at first column
 +
  sort
 +
next
 +
Return row at position
 +
 +
1st iteration
 +
  insert    sort
 +
m......    a......
 +
u......    f......
 +
f......    f......
 +
n......    h......
 +
f......    m......
 +
a......    n......
 +
h......    u......
 +
 +
2nd iteration
 +
  insert    sort
 +
ma.....    an.....
 +
uf.....    ff.....
 +
ff.....    fm.....
 +
nh.....    hu.....
 +
fm.....    ma.....
 +
an.....    nh.....
 +
hu.....    uf.....
 +
 +
...
 +
 +
6th interation
 +
  insert    sort
 +
manhuf.    anhuff.
 +
uffman.    ffmanh.
 +
ffmanh.    fmanhu.
 +
nhuffm.    huffma.
 +
fmanhu.    manhuf.
 +
anhuff.    nhuffm.
 +
huffma.    uffman.
 +
 +
7th iteration
 +
  insert    sort
 +
manhuff    anhuffm
 +
uffmanh    ffmanhu
 +
ffmanhu    fmanhuf
 +
nhuffma    huffman  <-- Answer at row 4
 +
fmanhuf    manhuff
 +
anhuffm    nhuffma
 +
huffman    uffmanh
  
The data we got is the last column, if we sort the data we got the first column. We also know that the first and last column are symbol pairs. Finally we know that each row is in descending order. That is enough information to reconstruct the original table.
 
  
 
[[Category:Computing]]
 
[[Category:Computing]]

Revision as of 11:58, 13 October 2012

A simple way of reducing the size of a block of data is to replace the most common words with shorter ones that are not already used in the data. It is possible to reconstruct the original data using a code book that lists the short words and the longer words they replace. The problem is to figure out which word combination will give the best possible compression.

Generating the Huffman codes using a binary tree

David A. Huffman found the optimal way of generating codes that guarantees the shortest possible output. It is done by creating a binary tree. Each node in the tree has at most two child nodes, distinguished as "left" and "right". Nodes with children are parent nodes, child nodes contain links to their parents

Make a table with one node for each symbol
Find the frequency of each symbol in the data

do while there are at least two active nodes
  Find the two active nodes with the lowest frequency, these are the children
  Create a new parent node and link it to each of the two child nodes, let it contain the sum of the two children
  Mark the two child nodes as as inactive
loop

Example

"Hello_World"

Symbol   Frequency
---------------
     H - 1
     e - 1
     W - 1
     r - 1
     d - 1
     _ - 1
     o - 2
     l - 3


Each symbol by frequency in ascending order from left to right:

H e W r d _ o l ← Symbols
1 1 1 1 1 1 2 3 ← Leaf nodes
\ / \ / \ / | |
 2   2   2  | |
  \ /     \ / |
   4       4  | 
    \       \ / ← Branch
     \	     7
      \     / 
       \   /
         11  ← Root node

The root node will contain the sum of all leaf nodes which is the same as the length of the message. This can be used as a checksum to make sure that all nodes have been visited. Now we can generate the code for each symbol, we assign 0 for left branch and 1 for right branch.

Symbol   Code
---------------
     H - 000
     e - 001
     W - 010
     r - 011
     d - 1000
     _ - 1001
     o - 101
     l - 11

The codes have the property that no code starts with the same bit pattern as the complete pattern of another code. This means that during the decoding we don't need to know the length of the next code, we just keep reading bits until we have enough to identify a leaf node in the tree.

If we encode the original text we get:

H   e   l  l  o   _    W   o   r   l  d  
000 001 11 11 101 1001 010 101 011 11 1000

The original message was 11 * 8 = 88 bits long, the compressed message is 32 bits long, a compression ratio of 88 / 32 = 2.75 : 1. This seems very good at first, but we need to include the Huffman tree in the message otherwise we would not be able to recover the original.

Storing the Huffman tree efficiently

Storing the symbol, the code and the code length takes quite a lot of room and reduces the compression ratio. It turns out that there are many different Huffman trees that will give the same compression ratio. To see how this is true, take any two codes of the same length and swap them around, it will not affect the compression ratio. If two symbols of the same frequency have different code lengths, swapping the codes will not change the compression ratio either. This redundancy is why storing the complete tree is inefficient.

It turns out that it is possible to reconstruct an optimal tree from just knowing the code lengths of each symbol.

Sort all symbols by code length in ascending order

code = 0
for each symbol
    print symbol, code
    code = (code + 1) << ((next code length) - (current code length))
loop

By using this algorithm in both the compressor and decompressor we are guaranteed to have the same tree generated in both and we only have to store the code lengths for each symbol. In the case of coding a relatively small range of symbols there is no need to store the symbols themselves. The code length is stored in the index that equals the symbol. So if each symbol is a byte we just store 256 code lengths, each can be stored in 4 bits, so 128 bytes in total.

Fast decompression

Reading the compressed data bit by bit and using it to navigate the tree is very slow. Since the start of each code is guaranteed to be unique for lengths of other codes it is possible to make a table that contains the symbols and the number of bits in each symbol. That way decompression is just a very fast table lookup.

Imagine a code 101 and maximum bit length 12 bits. Fill the 512 addresses of the form 101xxxxxxxxx with the symbol for code 101. Then each lookup with an index that starts with the bits 101 is guaranteed to return the correct symbol. The code length stored in the table is used to shift the variable that contains the codes so that the next code is ready to be looked up.

Transforms to increase efficiency

Consider a string of the form 11111111222222223333333344444444, it would not compress with Huffman coding because each symbol has the same frequency in the string. The key is to find a simple transform that will transform the regularity in the data into irregularity in code frequency.

Move-to-front

Keeps a list of all symbols and outputs the index of each symbol transformed, that symbol is then moved to the top of the list. Common symbols will stay close to the top of the list and be represented by small values.

Variations of the transform can move the symbol towards the top, that way rare symbols will not push a very common symbol down from the top position.

create table of symbols

for each symbol
  search table for symbol
  print index of found symbol
  move range 0 to found symbol down one cell (overwrite found symbol)
  insert found symbol at index 0  
loop

10000000200000003000000040000000

EOR

Storing the exclusive or of each byte and the byte before it.

previous = 0
for each symbol
  print symbol eor previous
  previous = symbol
loop

10000000300000001000000070000000

Delta

Storing the difference between each byte.

previous = 0
for each symbol
  print symbol - previous
  previous = symbol
loop

10000000100000001000000010000000

Burrows-Wheeler

This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original position. If followed with move-to-front it is very efficient.

Forward transform

Create all rotations of the input string "huffman"
Sort rows by first column
Return the last column and the new position of the input string 

huffman        anhuffm
uffmanh        ffmanhu
ffmanhu  sort  fmanhuf
fmanhuf   -->  huffman  <-- New position (4)
manhuff        manhuff
anhuffm        nhuffma
nhuffma        uffmanh

Reverse transform

string = "mufnfah"
position = 4
Create an empty 2D array of dimension len(string) * len(string)
for len(string)
  insert string at first column
  sort
next
Return row at position 

1st iteration
 insert     sort
m......    a......
u......    f......
f......    f......
n......    h......
f......    m......
a......    n......
h......    u......

2nd iteration
 insert     sort
ma.....    an.....
uf.....    ff.....
ff.....    fm.....
nh.....    hu.....
fm.....    ma.....
an.....    nh.....
hu.....    uf.....

...

6th interation
 insert     sort
manhuf.    anhuff.
uffman.    ffmanh.
ffmanh.    fmanhu.
nhuffm.    huffma.
fmanhu.    manhuf.
anhuff.    nhuffm.
huffma.    uffman.

7th iteration
 insert     sort
manhuff    anhuffm
uffmanh    ffmanhu
ffmanhu    fmanhuf
nhuffma    huffman  <-- Answer at row 4
fmanhuf    manhuff
anhuffm    nhuffma
huffman    uffmanh