# Floating Point Encoding

Encoding non-integer values requires the use of Scientific or Floating-point notation. Many floating-point systems exist but one of the more common is the IEEE 754 standard.

A value encoded in floating-point format is composed of two major components: a mantissa and an exponent. The actual structure of these is examined for the IEEE 754 standard.

Floating-point values may be handled completely through software, by means of an extra "math co-processor" chip, or by complex instructions built into the main processor. Most modern desktop and laptop computers have floating-point built-in to the CPU chip (e.g. AMD Athlon, Intel Pentium). Floating point emulated in software is much, much slower than hardware floating point, but the software can be written to retain greater precision or range than the native hardware.

## Fractional Values

Fractions can be handled in 3 different ways:
• The Ratio of 2 Integer Values - This is the first form most people consider when they think of the term "fraction" e.g. 2/7. Except for some specially developed software packages or very specialized mathematics co-processors, this form is not supported by computers.
• Fixed-Point Values - We often refer to this form as "decimal fractions" although the base-10 (decimal) system really has nothing to do with it (the same positional notation could be, and sometimes is, used with binary and hexadecimal represented values). A (decimal) point is used to separate the digit position with a weight of 1 from digit positions to its right representing values whose weights are 1/base (e.g. 1/10 for decimal) smaller than the weight to their immediate left. A specific "Fixed point" values is always coded with the same number of digits to the right of the (decimal) point. Computers store this kind of "fraction" internally in the same way as they store integer values and divide by an appropriate power of the base when combining this value with other numbers or when performing IO. For example, "dollar values" may often be stored as fixed point values with 2 digits to the right of the decimal point; in fact, the value is stored as an integer number of cents and only divided by 100 for output display.
• Floating-Point Values - Often when using numbers to represent "real world" measurements, we know that our values are not perfectly accurate. We say that the distance between two towns is a certain number of kilometers, where even if we are correct in our statement to the nearest kilometer, we most certainly are not correct to the nearest millimeter. Furthermore, we don't care! The value we have given is "good enough" for our purposes. Whether some distance is 27 kilometers or 270 kilometers is a much more important question than whether some other distance is 300,000 kilometers or 301,000 kilometers. Differences in "scale" are more important than differences in "precision" at the same "scale". Floating-point values are a method for representing values in a manner that recognizes this difference in importance. "Scale" is considered as a separate value separate from some (approximated) "precision" value. "Scientific notation" is a special form of this, often used (apart from computer systems) by engineers and scientists. As an example, 301,000 would be represented in "scientific notation" as 3.01E5; the precision, in this case the 301 portion, is normally stated as a value with exactly one non-zero digit written to the left of the decimal point; the scale, in this case 5, indicates how many positions the decimal point needs to be moved to the right in order to match the intended "real world" measurement (negative scale values mean that the decimal point needs to move to the left). As the "scale" value changes, the decimal point moves around or "floats" within the "precision" value. In mathematical language, the "precision" value is called the "mantissa" and the "scale" value is called the "exponent".

## IEEE 754 Standard

• Normalized Mantissa (Base 2) - in "normalized" form, a floating point value in binary is represented in two pieces - a mantissa 1.xxxx multiplied by 2 raised to the power of some exponent, where xxxx represents some binary sequence of zeroes and ones in the mantissa. A normalized base 2 floating point value will always have a single 1 bit to the left of the decimal point; that's what "normalized" means. Since every number will have a one there, encoding it in the mantissa is redundant; the 1 bit can be assumed without taking up any actual code bit space.
• Excess-127 Exponent (Base 2) - the IEEE 754 standard specifies that the "exponent" will be encoded as an 8-bit value, using the unsigned binary code of a value which is 127 more than (in "excess" of) the actual base 2 exponent required to represent the desired value.
• 32-bit Format - this format allows for approximately 7 decimal digits of precision and a scale of approximately 40 digits.
```      1-bit sign | 8-bit exponent | 23-bit mantissa
(sign bit = 1 for negative values)

For example:
43 4D 40 00 (hex)
re-written in binary:
0100 0011  0100 1101  0100 0000 0000 0000
re-grouped:
sign:   exponent(+127):   mantissa:
0       10000110         (1.)100110101000...
(positive)    134(dec)
exponent:
134-127 = 7
moving the decimal point 7 positions to the right:
1   1  0  0  1  1  0  1 .  0   1
128 64 32 16  8  4  2  1   1/2 1/4 (weights)
= +205.25
```
• 64-bit Format - the 64-bit ("double precision") format has the same structure as the 32-bit format except that it uses an 11-bit exponent encoded in excess-1023 notation and a 52-bit mantissa. This provides for approximately 15 decimal digits of precision and a scale of 300 digits (in decimal).
• Special Values - note that some patterns are reserved for special values. Floating point values with either all 0's or all 1's represent non-standard (special) values. Specifically, for example, the value 0 is represented by a floating point encoding with all 0's in both the exponent and mantissa fields.

## Other Floating Point Forms

Main mainframe computers were designed prior to the establishment of the IEEE 754 standard and employ their own format for floating point encoding.
• IBM Mainframe - the IBM mainframe has three different floating point forms: a 32-bit, a 64-bit, and a 128-bit form. Unlike the IEEE forms, the exponent field is the same length for all forms: 7-bits (in excess-64 notation); the longer floating point forms have increased precision but not increased scale. The encoded base is 16 (instead of 2) so that the exponent is actually a indicator of how many times the decimal point should be shifted 4 bits to the right (or to the left if the exponent is negative); this results in an effective scale of a little over 250 (for decimal values). Note also that because the base is 16, the normalized precision may start with any value between 1 and 15; the leading value must therefore be encoded in the floating point form.

## Implementation Methods

How floating point arithmetic is actually performed varies among different computer systems.

• Software - in small or older microcomputer systems, floating point manipulation is/was done using software subroutines; most microcomputer languages still include these subroutines when you compile and link a program which uses floating point values.
• Math Co-Processor - some older microcomputers included a second "math" processor (either as a second chip or built-in to the main processor); this processor has instructions, not found in the main processor, for performing floating point (and often BCD) arithmetic directly with hardware circuits.
• Built-in Floating Point Instructions - mainframe computers and most modern desktop/laptop CPUs include floating point manipulation instructions within the main processor's instruction set, e.g. AMD Athlon, Intel Pentium.

## Floating-Point Errors

Floating point number systems in computers are a huge compromise. Mathematically, there are an infinite number of floating-point values between any two numbers, but a finite computer can only represent a finite number of those values. (A 32-bit floating-point number could, at best, only represent 2**32 different numbers out of the infinite number of numbers possible.) The computer will also have limits on the largest and smallest floating-point numbers that can be represented. What are the chances that the computer can accurately represent the floating-point values you need in your calculations? (Hint: It's not very likely, unless you're extremely careful.)

Even staying well within the range of floating-point values possible, floating-point sacrifices some precision (bits) in exchange for being able to use those bits for an exponent to increase range. A 32-bit integer has more precision than an 32-bit floating-point value, since some of the floating-point bits are used for the exponent and sign. The lack of precision and the separate exponent field means that there are large "gaps" between floating-point values. For example, consider the binary and decimal forms of two adjacent IEEE 754 single-precision 32 bit floating-point values with an exponent multiplier of 2**50 (note: IEEE mantissa size is 23+1 bits):

```    1.00000000000000000000000(2) times 2**50 = 1125899906842624.0 decimal
1.00000000000000000000001(2) times 2**50 = 1125900041060352.0 decimal
FLOATING-POINT ADJACENT NUMBER GAP:               134217728.0 decimal !

The difference between the above two numbers is just one bit in the
mantissa.  We can get the same GAP number by looking at that one bit:

0.00000000000000000000001(2) times 2**50
= 2**(-23) times 2**50
= 2**27
= 134,217,728 decimal, the same as above GAP number
```

The above two numbers differ by only one least-significant bit in the 23-bit mantissa - they are as close together as possible - yet they represent very different values, with a difference between them of 2**(50-23)=2**27, or 134,217,728 decimal. There is a huge number "gap" between these floating point numbers, caused by the small difference in the mantissa being amplified hugely when multiplied by the 2**50 exponent.

Even though we made the smallest possible change in the mantissa value - just one least-significant bit - the change was multiplied by the exponent and resulted in a change of 134,217,728 (2**27) in the actual value of the number. Since this is the smallest change we could make to the mantissa, we see that there is no way for the computer to represent any of the floating-point numbers in the huge 2**27 number gap between 1125899906842624.0 and 1125900041060352.0. If your calculation answer falls in this gap, the computer will either truncate down to 1125899906842624.0 or round up to 1125900041060352.0 - the nearest actually representable floating-point number - a potential error of half the gap width 134,217,728 or 67,108,864!

All IEEE 754 single-precision floating-point numbers that have 2**50 as their exponent have gaps of 2**(50-23)=2**27 between them. Bigger exponents cause bigger gaps between bigger numbers; smaller exponents cause smaller gaps between smaller numbers. For example, all floating-point numbers with exponents of 2**30 have gaps of 2**(30-23)=2**7 between them. Floating-point numbers with exponents of 2**23 (from 8,388,608 to 16,777,215) have gaps of 2**(23-23)=2**0 (1) between them - they behave similarly to 23-bit integers because you cannot express any non-integer fractional values in this range. Numbers with exponents of 2**0 have gaps of 2**(0-23)=2**(-23) between them. The gaps get smaller as the magnitude of the number shrinks, but the gap never goes to zero. (It can't - there are an infinite number of floating-point numbers between every pair of numbers.)

The larger the exponent, the bigger the gaps are in adjacent floating-point numbers. At the upper limit of IEEE 754 single-precision floating point, with an exponent value of approximately 2**128, one-bit changes in the mantissa result in actual numeric value changes of approximately 2**(128=23)=2**105 or 40564819207303340847894502572032 decimal! Adjacent IEEE floating point numbers with large exponents are very far apart!

### Failure to add floating-point numbers

Because of the gaps between adjacent floating-point numbers, math that tries to add numbers smaller than the gap width fails. If A and B are adjacent floating-point numbers (e.g. one least-significant mantissa bit different), then the computer has no way of representing any numbers in the gap between A and B. If you do math A+X, and X is smaller than half the gap between A and B, then X won't be big enough to change A into B and nothing will happen to the value of A; the answer is wrong (too small). You can't go part-way into the gap; you can only go from A to B. Adding values of X that are more than half the gap width between A and B will trigger a jump from A all the way up to B; this answer is also wrong (too big).

Refer to our example above, where our adjacent floating-point numbers are A=1125899906842624.0 and B=1125900041060352.0 with a gap of 134,217,728 (2**27) between them. Attempts to add any value X that is half of the gap (2**27=134,217,728) or less will have no effect on numbers this large - the value won't be large enough to bridge the gap between the numbers and change one floating-point number A to its adjacent neighbour B.

Adding any numbers to A that are larger than half the gap size will push the sum across the gap up to the next available floating-point number, B. Because of the rounding up, the sum will be wrong - too big - but it will be less wrong than staying with the original number A. For example:

```   1125899906842624.0 + 134217728.0/2 = 1125899906842624.0 [NO CHANGE!]
1125899906842624.0 + 200000000.0/2 = 1125900041060352.0 [JUMPS THE GAP!]
```

The last answer above should be 1125900006842624.0, but that number lies in the gap between A=1125899906842624.0 and B=1125900041060352.0 and is therefore not representable. Since the desired answer 1125900006842624.0 is closer to B=1125900041060352.0 than to A=1125899906842624.0, the computer rounds up to B and jumps the gap to pick the larger value (though it's wrong).

### Changing order of operations to minimize error

Mathematically, A+(B+C) always equals (A+B)+C. This is not true for floating-point arithmetic. Below are some examples, using numbers from above. Recall that the gap between floating point numbers with exponent 2**50 is 2**(50-23)=2**27 and any number half the gap size or smaller will disappear (have no effect) when used in an addition:

```    A = 1125899906842624.0    [2**50]
B = 67108864.0            [2**26 - half the GAP size]
C = 67108864.0            [2**26 - half the GAP size]

Mathematically, A+B+C should sum to 2**50 + 2*(2**26) or 2**50+2**27
and, mathematically, we should get that answer using (A+B)+C or A+(B+C).
Watch how the order of operations changes the answer given by the computer:

1. (A+B)+C = (2**50+2**26)+2**26 = (2**50)+2**26 = 2**50 [no change!]
[wrong answer, because 2**26 is half the gap and A+B=A and A+C=A]
2. A+(B+C) = 2**50+(2**26+2**26) = 2**50+2**27 = correct answer!
[correct answer, because B+C=2**27 is large enough to bridge the gap]
```

For most accurate results, floating-point math should sum positive numbers of equivalent magnitude, i.e. similar exponents. Add up all your small numbers before you add the sum to the big numbers.

The opposite is true when subtracting floating-point numbers: Avoid subtracting floating-point numbers that are nearly equal - the cancellation of the significant digits leaves only the unreliable less-significant digits in the answer.

### Failure to compare floating-point numbers

Since floating-point numbers are only finite approximations of real mathematical values, it makes no sense to compare floating-point numbers for exact equality. Two floating-point approximations may never be exactly equal, yet they may be close enough to each other to be considered "approximately equal".

```    if ( Math.abs(a - b) < 1.0e-5 ) {
System.out.println("Close enough to be approximately equal.");
}
```

Note that since the spacings between floating-point numbers gets larger as the size of the numbers get larger, you may have to adjust your idea of "approximately equal" to suit the magnitude of the numbers you are comparing. For numbers such as those above, i.e. with an exponent of 2**50 and an inter-number gap of 2**27, the value "approximately equal" should be at least the width of the gap (2**27) and more likely something several times larger than the gap width, such as 2**29 or 2**30:

```    // hexadecimal floating-point constant with power-of-two exponent:
if ( Math.abs(a - b) < 1.0p30 ) {
System.out.println("Close enough to be approximately equal.");
}
```

### Summary of Floating Point Errors

• Floating-point numbers are approximations. Computers contain a finite sample of the infinite number of floating-point values. Hope that your answer is one of the values the computer can represent (or is very close to it).
• Never compare floating-point approximations for equality. Test the absolute value of the difference against an acceptable "approximately equal" value suited to the range of the numbers being compared.
• Associativity does not hold for floating-point numbers.
• Smaller floating-point numbers are spaced closer together than larger floating-point numbers - the "gap" width between adjacent floating-point numbers varies with the size of the exponent used.
• Adding floating-point numbers of different magnitudes is subject to loss of precision. Where possible, only add numbers of similar magnitude. (Add up all your small numbers before you tackle the big ones.)
• Avoid subtracting numbers that are nearly equal - the cancellation of the significant digits leaves only the unreliable less-significant digits in the answer.
• Read the literature on floating-point before you try to use it!