Encoding non-integer values requires the use of Scientific or Floating-point notation. Many floating-point systems exist but one of the more common is the IEEE 754 standard.
A value encoded in floating-point format is composed of two major components: a mantissa and an exponent. The actual structure of these is examined for the IEEE 754 standard.
Floating-point values may be handled completely through software, by means of an extra "math co-processor" chip, or by complex instructions built into the main processor. Most modern desktop and laptop computers have floating-point built-in to the CPU chip (e.g. AMD Athlon, Intel Pentium). Floating point emulated in software is much, much slower than hardware floating point, but the software can be written to retain greater precision or range than the native hardware.
1-bit sign | 8-bit exponent | 23-bit mantissa (sign bit = 1 for negative values) For example: 43 4D 40 00 (hex) re-written in binary: 0100 0011 0100 1101 0100 0000 0000 0000 re-grouped: sign: exponent(+127): mantissa: 0 10000110 (1.)100110101000... (positive) 134(dec) exponent: 134-127 = 7 moving the decimal point 7 positions to the right: 1 1 0 0 1 1 0 1 . 0 1 128 64 32 16 8 4 2 1 1/2 1/4 (weights) = +205.25
How floating point arithmetic is actually performed varies among different computer systems.
Floating point number systems in computers are a huge compromise. Mathematically, there are an infinite number of floating-point values between any two numbers, but a finite computer can only represent a finite number of those values. (A 32-bit floating-point number could, at best, only represent 2**32 different numbers out of the infinite number of numbers possible.) The computer will also have limits on the largest and smallest floating-point numbers that can be represented. What are the chances that the computer can accurately represent the floating-point values you need in your calculations? (Hint: It's not very likely, unless you're extremely careful.)
Even staying well within the range of floating-point values possible, floating-point sacrifices some precision (bits) in exchange for being able to use those bits for an exponent to increase range. A 32-bit integer has more precision than an 32-bit floating-point value, since some of the floating-point bits are used for the exponent and sign. The lack of precision and the separate exponent field means that there are large "gaps" between floating-point values. For example, consider the binary and decimal forms of two adjacent IEEE 754 single-precision 32 bit floating-point values with an exponent multiplier of 2**50 (note: IEEE mantissa size is 23+1 bits):
1.00000000000000000000000(2) times 2**50 = 1125899906842624.0 decimal 1.00000000000000000000001(2) times 2**50 = 1125900041060352.0 decimal FLOATING-POINT ADJACENT NUMBER GAP: 134217728.0 decimal ! The difference between the above two numbers is just one bit in the mantissa. We can get the same GAP number by looking at that one bit: 0.00000000000000000000001(2) times 2**50 = 2**(-23) times 2**50 = 2**27 = 134,217,728 decimal, the same as above GAP number
The above two numbers differ by only one least-significant bit in the 23-bit mantissa - they are as close together as possible - yet they represent very different values, with a difference between them of 2**(50-23)=2**27, or 134,217,728 decimal. There is a huge number "gap" between these floating point numbers, caused by the small difference in the mantissa being amplified hugely when multiplied by the 2**50 exponent.
Even though we made the smallest possible change in the mantissa value - just one least-significant bit - the change was multiplied by the exponent and resulted in a change of 134,217,728 (2**27) in the actual value of the number. Since this is the smallest change we could make to the mantissa, we see that there is no way for the computer to represent any of the floating-point numbers in the huge 2**27 number gap between 1125899906842624.0 and 1125900041060352.0. If your calculation answer falls in this gap, the computer will either truncate down to 1125899906842624.0 or round up to 1125900041060352.0 - the nearest actually representable floating-point number - a potential error of half the gap width 134,217,728 or 67,108,864!
All IEEE 754 single-precision floating-point numbers that have 2**50 as their exponent have gaps of 2**(50-23)=2**27 between them. Bigger exponents cause bigger gaps between bigger numbers; smaller exponents cause smaller gaps between smaller numbers. For example, all floating-point numbers with exponents of 2**30 have gaps of 2**(30-23)=2**7 between them. Floating-point numbers with exponents of 2**23 (from 8,388,608 to 16,777,215) have gaps of 2**(23-23)=2**0 (1) between them - they behave similarly to 23-bit integers because you cannot express any non-integer fractional values in this range. Numbers with exponents of 2**0 have gaps of 2**(0-23)=2**(-23) between them. The gaps get smaller as the magnitude of the number shrinks, but the gap never goes to zero. (It can't - there are an infinite number of floating-point numbers between every pair of numbers.)
The larger the exponent, the bigger the gaps are in adjacent floating-point numbers. At the upper limit of IEEE 754 single-precision floating point, with an exponent value of approximately 2**128, one-bit changes in the mantissa result in actual numeric value changes of approximately 2**(128=23)=2**105 or 40564819207303340847894502572032 decimal! Adjacent IEEE floating point numbers with large exponents are very far apart!
Because of the gaps between adjacent floating-point numbers, math that tries to add numbers smaller than the gap width fails. If A and B are adjacent floating-point numbers (e.g. one least-significant mantissa bit different), then the computer has no way of representing any numbers in the gap between A and B. If you do math A+X, and X is smaller than half the gap between A and B, then X won't be big enough to change A into B and nothing will happen to the value of A; the answer is wrong (too small). You can't go part-way into the gap; you can only go from A to B. Adding values of X that are more than half the gap width between A and B will trigger a jump from A all the way up to B; this answer is also wrong (too big).
Refer to our example above, where our adjacent floating-point numbers are A=1125899906842624.0 and B=1125900041060352.0 with a gap of 134,217,728 (2**27) between them. Attempts to add any value X that is half of the gap (2**27=134,217,728) or less will have no effect on numbers this large - the value won't be large enough to bridge the gap between the numbers and change one floating-point number A to its adjacent neighbour B.
Adding any numbers to A that are larger than half the gap size will push the sum across the gap up to the next available floating-point number, B. Because of the rounding up, the sum will be wrong - too big - but it will be less wrong than staying with the original number A. For example:
1125899906842624.0 + 134217728.0/2 = 1125899906842624.0 [NO CHANGE!] 1125899906842624.0 + 200000000.0/2 = 1125900041060352.0 [JUMPS THE GAP!]
The last answer above should be 1125900006842624.0, but that number lies in the gap between A=1125899906842624.0 and B=1125900041060352.0 and is therefore not representable. Since the desired answer 1125900006842624.0 is closer to B=1125900041060352.0 than to A=1125899906842624.0, the computer rounds up to B and jumps the gap to pick the larger value (though it's wrong).
Mathematically, A+(B+C) always equals (A+B)+C. This is not true for floating-point arithmetic. Below are some examples, using numbers from above. Recall that the gap between floating point numbers with exponent 2**50 is 2**(50-23)=2**27 and any number half the gap size or smaller will disappear (have no effect) when used in an addition:
A = 1125899906842624.0 [2**50] B = 67108864.0 [2**26 - half the GAP size] C = 67108864.0 [2**26 - half the GAP size] Mathematically, A+B+C should sum to 2**50 + 2*(2**26) or 2**50+2**27 and, mathematically, we should get that answer using (A+B)+C or A+(B+C). Watch how the order of operations changes the answer given by the computer: 1. (A+B)+C = (2**50+2**26)+2**26 = (2**50)+2**26 = 2**50 [no change!] [wrong answer, because 2**26 is half the gap and A+B=A and A+C=A] 2. A+(B+C) = 2**50+(2**26+2**26) = 2**50+2**27 = correct answer! [correct answer, because B+C=2**27 is large enough to bridge the gap]
For most accurate results, floating-point math should sum positive numbers of equivalent magnitude, i.e. similar exponents. Add up all your small numbers before you add the sum to the big numbers.
The opposite is true when subtracting floating-point numbers: Avoid subtracting floating-point numbers that are nearly equal - the cancellation of the significant digits leaves only the unreliable less-significant digits in the answer.
Since floating-point numbers are only finite approximations of real mathematical values, it makes no sense to compare floating-point numbers for exact equality. Two floating-point approximations may never be exactly equal, yet they may be close enough to each other to be considered "approximately equal".
if ( Math.abs(a - b) < 1.0e-5 ) { System.out.println("Close enough to be approximately equal."); }
Note that since the spacings between floating-point numbers gets larger as the size of the numbers get larger, you may have to adjust your idea of "approximately equal" to suit the magnitude of the numbers you are comparing. For numbers such as those above, i.e. with an exponent of 2**50 and an inter-number gap of 2**27, the value "approximately equal" should be at least the width of the gap (2**27) and more likely something several times larger than the gap width, such as 2**29 or 2**30:
// hexadecimal floating-point constant with power-of-two exponent: if ( Math.abs(a - b) < 1.0p30 ) { System.out.println("Close enough to be approximately equal."); }