Floating-point

From ScienceZero
Jump to: navigation, search

Almost all modern computers approximates real numbers by using floating point arithmetic as defined in the IEEE 754 standard.


Single Precision

The IEEE 754 single precision number requires 32 bits of storage.

 0 1      8 9                     31
 S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
  • S - Sign bit
  • E - Exponent
  • F - Fraction


The value of the 32 bit word:

  • If E = 255 and F is nonzero, then V = NaN ("Not a number")
  • If E = 255 and F is zero and S is 1, then V = -Infinity
  • If E = 255 and F is zero and S is 0, then V = Infinity
  • If 0<E<255 then V = (-1)**S * 2 ** (E-127) * (1.F) where "1.F" represents the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E = 0 and F is nonzero, then V = (-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values.
  • If E = 0 and F is zero and S is 1, then V = -0
  • If E = 0 and F is zero and S is 0, then V = 0

Double Precision

The IEEE 754 single precision number requires 64 bits of storage.

 0 1         11 12                                                  63
 S EEEEEEEEEEE   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • S - Sign bit
  • E - Exponent
  • F - Fraction


The value of the 64 bit word:

  • If E = 2047 and F is nonzero, then V = NaN ("Not a number")
  • If E = 2047 and F is zero and S is 1, then V = -Infinity
  • If E = 2047 and F is zero and S is 0, then V = Infinity
  • If 0<E<2047 then V = (-1)**S * 2 ** (E-1023) * (1.F) where "1.F" represents the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E = 0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
  • If E = 0 and F is zero and S is 1, then V = -0
  • If E = 0 and F is zero and S is 0, then V = 0