Difference between revisions of "Huffman coding"

From ScienceZero
Jump to: navigation, search
(Forward transform)
(Reverse transform)
 
(16 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
== Generating the Huffman codes using a binary tree ==
 
== Generating the Huffman codes using a binary tree ==
David A. Huffman found the optimal way of generating codes that guarantees the shortest possible output. It is done by creating a binary tree. Each node in the tree has at most two child nodes, distinguished as "left" and "right". Nodes with children are parent nodes, child nodes contain links to their parents
+
David A. Huffman found a way of generating optimal codes that guarantees the shortest possible output. It is done by creating a binary tree. Each node in the tree has at most two child nodes, distinguished as "left" and "right". Nodes with children are parent nodes, child nodes contain links to their parents
  
 
  Make a table with one node for each symbol
 
  Make a table with one node for each symbol
 
  Find the frequency of each symbol in the data
 
  Find the frequency of each symbol in the data
 
   
 
   
  do while there are at least two active nodes
+
  while there are at least two active nodes
 
   Find the two active nodes with the lowest frequency, these are the children
 
   Find the two active nodes with the lowest frequency, these are the children
 
   Create a new parent node and link it to each of the two child nodes, let it contain the sum of the two children
 
   Create a new parent node and link it to each of the two child nodes, let it contain the sum of the two children
 
   Mark the two child nodes as as inactive
 
   Mark the two child nodes as as inactive
loop
 
  
 
== Example ==
 
== Example ==
Line 71: Line 70:
 
  code = 0
 
  code = 0
 
  for each symbol
 
  for each symbol
     print symbol, code
+
     print symbol, code, length
     code = (code + 1) << ((next code length) - (current code length))
+
     code = (code + 1) << ((next symbol code length) - (current code length))
loop
+
  
 
By using this algorithm in both the compressor and decompressor we are guaranteed to have the same tree generated in both and we only have to store the code lengths for each symbol. In the case of coding a relatively small range of symbols there is no need to store the symbols themselves. The code length is stored in the index that equals the symbol. So if each symbol is a byte we just store 256 code lengths, each can be stored in 4 bits, so 128 bytes in total.
 
By using this algorithm in both the compressor and decompressor we are guaranteed to have the same tree generated in both and we only have to store the code lengths for each symbol. In the case of coding a relatively small range of symbols there is no need to store the symbols themselves. The code length is stored in the index that equals the symbol. So if each symbol is a byte we just store 256 code lengths, each can be stored in 4 bits, so 128 bytes in total.
Line 80: Line 78:
 
Reading the compressed data bit by bit and using it to navigate the tree is very slow. Since the start of each code is guaranteed to be unique for lengths of other codes it is possible to make a table that contains the symbols and the number of bits in each symbol. That way decompression is just a very fast table lookup.
 
Reading the compressed data bit by bit and using it to navigate the tree is very slow. Since the start of each code is guaranteed to be unique for lengths of other codes it is possible to make a table that contains the symbols and the number of bits in each symbol. That way decompression is just a very fast table lookup.
  
Imagine a code 101 and maximum bit length 12 bits. Fill the 512 addresses of the form 101xxxxxxxxx with the symbol for code 101. Then each lookup with an index that starts with the bits 101 is guaranteed to return the correct symbol. The code length stored in the table is used to shift the variable that contains the codes so that the next code is ready to be looked up.
+
Imagine a code 101 and maximum code length 12 bit. Fill the 512 addresses of the form 101xxxxxxxxx with the symbol for code 101. Then each lookup with an index that starts with the bits 101 is guaranteed to return the correct symbol. The code length stored in the table is used to shift the variable that contains the codes so that the next code is ready to be looked up.
  
 
== Transforms to increase efficiency ==
 
== Transforms to increase efficiency ==
Line 97: Line 95:
 
   move range 0 to found symbol down one cell (overwrite found symbol)
 
   move range 0 to found symbol down one cell (overwrite found symbol)
 
   insert found symbol at index 0   
 
   insert found symbol at index 0   
loop
 
 
   
 
   
 
  10000000200000003000000040000000
 
  10000000200000003000000040000000
Line 107: Line 104:
 
   print symbol eor previous
 
   print symbol eor previous
 
   previous = symbol
 
   previous = symbol
loop
 
 
   
 
   
 
  10000000300000001000000070000000
 
  10000000300000001000000070000000
Line 118: Line 114:
 
   print symbol - previous
 
   print symbol - previous
 
   previous = symbol
 
   previous = symbol
loop
 
 
   
 
   
 
  10000000100000001000000010000000
 
  10000000100000001000000010000000
  
 
=== Burrows-Wheeler ===
 
=== Burrows-Wheeler ===
This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original position. If followed with move-to-front it is very efficient.
+
This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original unsorted position. It is very efficient if followed with move-to-front.
  
 
==== Forward transform ====
 
==== Forward transform ====
Line 130: Line 125:
 
  Return the last column and the new row position of the input string  
 
  Return the last column and the new row position of the input string  
 
   
 
   
  huffman        anhuffm
+
  huffman        anhuff m
  uffmanh        ffmanhu
+
  uffmanh        ffmanh u
  ffmanhu  sort  fmanhuf
+
  ffmanhu  sort  fmanhu f
  fmanhuf  -->  huffman <-- New position is row 4
+
  fmanhuf  -->  huffma n <-- New position is row 4
  manhuff        manhuff
+
  manhuff        manhuf f
  anhuffm        nhuffma
+
  anhuffm        nhuffm a
  nhuffma        uffmanh
+
  nhuffma        uffman h
  
 
==== Reverse transform ====
 
==== Reverse transform ====
Line 145: Line 140:
 
  for len(string)
 
  for len(string)
 
   insert string at first column
 
   insert string at first column
   sort
+
   sort rows
next
+
 
  Return row at position  
 
  Return row at position  
 
   
 
   
Line 186: Line 180:
 
  uffmanh    ffmanhu
 
  uffmanh    ffmanhu
 
  ffmanhu    fmanhuf
 
  ffmanhu    fmanhuf
  nhuffma    huffman  <-- Answer at row 4
+
  nhuffma    huffman  <-- Result at row 4
 
  fmanhuf    manhuff
 
  fmanhuf    manhuff
 
  anhuffm    nhuffma
 
  anhuffm    nhuffma

Latest revision as of 16:10, 13 December 2015

A simple way of reducing the size of a block of data is to replace the most common words with shorter ones that are not already used in the data. It is possible to reconstruct the original data using a code book that lists the short words and the longer words they replace. The problem is to figure out which word combination will give the best possible compression.

Generating the Huffman codes using a binary tree

David A. Huffman found a way of generating optimal codes that guarantees the shortest possible output. It is done by creating a binary tree. Each node in the tree has at most two child nodes, distinguished as "left" and "right". Nodes with children are parent nodes, child nodes contain links to their parents

Make a table with one node for each symbol
Find the frequency of each symbol in the data

while there are at least two active nodes
  Find the two active nodes with the lowest frequency, these are the children
  Create a new parent node and link it to each of the two child nodes, let it contain the sum of the two children
  Mark the two child nodes as as inactive

Example

"Hello_World"

Symbol   Frequency
---------------
     H - 1
     e - 1
     W - 1
     r - 1
     d - 1
     _ - 1
     o - 2
     l - 3


Each symbol by frequency in ascending order from left to right:

H e W r d _ o l ← Symbols
1 1 1 1 1 1 2 3 ← Leaf nodes
\ / \ / \ / | |
 2   2   2  | |
  \ /     \ / |
   4       4  | 
    \       \ / ← Branch
     \	     7
      \     / 
       \   /
         11  ← Root node

The root node will contain the sum of all leaf nodes which is the same as the length of the message. This can be used as a checksum to make sure that all nodes have been visited. Now we can generate the code for each symbol, we assign 0 for left branch and 1 for right branch.

Symbol   Code
---------------
     H - 000
     e - 001
     W - 010
     r - 011
     d - 1000
     _ - 1001
     o - 101
     l - 11

The codes have the property that no code starts with the same bit pattern as the complete pattern of another code. This means that during the decoding we don't need to know the length of the next code, we just keep reading bits until we have enough to identify a leaf node in the tree.

If we encode the original text we get:

H   e   l  l  o   _    W   o   r   l  d  
000 001 11 11 101 1001 010 101 011 11 1000

The original message was 11 * 8 = 88 bits long, the compressed message is 32 bits long, a compression ratio of 88 / 32 = 2.75 : 1. This seems very good at first, but we need to include the Huffman tree in the message otherwise we would not be able to recover the original.

Storing the Huffman tree efficiently

Storing the symbol, the code and the code length takes quite a lot of room and reduces the compression ratio. It turns out that there are many different Huffman trees that will give the same compression ratio. To see how this is true, take any two codes of the same length and swap them around, it will not affect the compression ratio. If two symbols of the same frequency have different code lengths, swapping the codes will not change the compression ratio either. This redundancy is why storing the complete tree is inefficient.

It turns out that it is possible to reconstruct an optimal tree from just knowing the code lengths of each symbol.

Sort all symbols by code length in ascending order

code = 0
for each symbol
    print symbol, code, length
    code = (code + 1) << ((next symbol code length) - (current code length))

By using this algorithm in both the compressor and decompressor we are guaranteed to have the same tree generated in both and we only have to store the code lengths for each symbol. In the case of coding a relatively small range of symbols there is no need to store the symbols themselves. The code length is stored in the index that equals the symbol. So if each symbol is a byte we just store 256 code lengths, each can be stored in 4 bits, so 128 bytes in total.

Fast decompression

Reading the compressed data bit by bit and using it to navigate the tree is very slow. Since the start of each code is guaranteed to be unique for lengths of other codes it is possible to make a table that contains the symbols and the number of bits in each symbol. That way decompression is just a very fast table lookup.

Imagine a code 101 and maximum code length 12 bit. Fill the 512 addresses of the form 101xxxxxxxxx with the symbol for code 101. Then each lookup with an index that starts with the bits 101 is guaranteed to return the correct symbol. The code length stored in the table is used to shift the variable that contains the codes so that the next code is ready to be looked up.

Transforms to increase efficiency

Consider a string of the form 11111111222222223333333344444444, it would not compress with Huffman coding because each symbol has the same frequency in the string. The key is to find a simple transform that will transform the regularity in the data into irregularity in code frequency.

Move-to-front

Keeps a list of all symbols and outputs the index of each symbol transformed, that symbol is then moved to the top of the list. Common symbols will stay close to the top of the list and be represented by small values.

Variations of the transform can move the symbol towards the top, that way rare symbols will not push a very common symbol down from the top position.

create table of symbols

for each symbol
  search table for symbol
  print index of found symbol
  move range 0 to found symbol down one cell (overwrite found symbol)
  insert found symbol at index 0  

10000000200000003000000040000000

EOR

Storing the exclusive or of each byte and the byte before it.

previous = 0
for each symbol
  print symbol eor previous
  previous = symbol

10000000300000001000000070000000

Delta

Storing the difference between each byte.

previous = 0
for each symbol
  print symbol - previous
  previous = symbol

10000000100000001000000010000000

Burrows-Wheeler

This is a remarkable transform that sorts the symbols then stores the symbol that was in front of each symbol in the original unsorted position. It is very efficient if followed with move-to-front.

Forward transform

Create all rotations of the input string "huffman"
Sort rows
Return the last column and the new row position of the input string 

huffman        anhuff m
uffmanh        ffmanh u
ffmanhu  sort  fmanhu f
fmanhuf   -->  huffma n  <-- New position is row 4
manhuff        manhuf f
anhuffm        nhuffm a
nhuffma        uffman h

Reverse transform

string = "mufnfah"
position = 4

Create an empty 2D array of dimension len(string) * len(string)
for len(string)
  insert string at first column
  sort rows
Return row at position 

1st iteration
 insert     sort
m......    a......
u......    f......
f......    f......
n......    h......
f......    m......
a......    n......
h......    u......

2nd iteration
 insert     sort
ma.....    an.....
uf.....    ff.....
ff.....    fm.....
nh.....    hu.....
fm.....    ma.....
an.....    nh.....
hu.....    uf.....

...

6th interation
 insert     sort
manhuf.    anhuff.
uffman.    ffmanh.
ffmanh.    fmanhu.
nhuffm.    huffma.
fmanhu.    manhuf.
anhuff.    nhuffm.
huffma.    uffman.

7th iteration
 insert     sort
manhuff    anhuffm
uffmanh    ffmanhu
ffmanhu    fmanhuf
nhuffma    huffman  <-- Result at row 4
fmanhuf    manhuff
anhuffm    nhuffma
huffman    uffmanh