IEEE 754

Objectives

We would like to :

Develop a basic understanding of the IEEE 754 format.
Discuss how floating point mathematics might be more difficult than integer.

Notes

This is from Carpinelli page 32 of the pdf 1-12 in the document.
Institute of Electrical and Electronics Engineers
Early History Distilled
- There were many different formats.
- Precision and Gap were different on all of these.
- Moving from one machine to another caused all kinds of problems with the results.
754 standardized this in 1985.
Mostly numbers are represented as in scientific notation,
- Given a number +3.24x10^-15
  - The leading + is the sign, +/-
  - The 3.24 is the mantissa
  - the -15 is the exponent
- In scientific notation, we normally require the mantissa to be more than 0 and less than 10.
- This is called normalized.
When we switch to base 2, in a fixed number of bits things change.
- We call the mantissa the significand, a normalized mantissa
- The significand almost always starts with a 1.
Representation
- In 32 bits
  - The first bit is a sign bit.
    - 0 for +, 1 for -
    - The actual computation is (-1)^bit
    - (-1)⁰ = 1
    - (-1)¹ = -1
  - The exponent is represented in 8 bits.
    - But it is represented in excess 128 format
    - Subtract 128 to find the actual exponent
    - The bit pattern 10000000 is 0, since 128-128 = 0
    - The bit pattern 10000001 is 1, since 129-128 = 0
    - The bit pattern 01111111 is -1, since 127-128 = -1
    - The bit pattern 00000000 is a special case, see below.
    - The bit pattern 11111111 is a special case, see below.
  - The next 23 bits represent the signifcand, but
    - Since the number is 1.xxx the 1 is implied.
  - Please note, his diagrams in the book are wrong
- When the exponent is 0 (the pattern 0) the number is special
  - This is called a denormal or subnormal
  - It represents a number very close to 0
  - The representation is 0.bitpattern x 2 ^-127
  - These are apparently bad and can be turned off.
- When the exponent is 128, the number is infinity
See float.cpp.