=================================================
Assignment #05  Number Systems with Floating Point Numbers
=================================================
 Ian! D. Allen  idallen@idallen.ca  www.idallen.com
Sources for many of these answers (thank you!):
Ian Allen
Jim Khuu
Terence Christie
William Jarvis
Corrections by (thank you!):

1. Conversions from different bases and representations to decimal:
Reference: http://en.wikipedia.org/wiki/Signed_number_representations
a) Given the digits 010101(16), convert to decimal from base "16" =
16^5 16^4 16^3 16^2 16^1 16^0
1048576 65536 4096 256 16 1
0 1 0 1 0 1

0 + 65536 +0 + 256 +0 + 1
= 65793 base 10
b) Given the digits 010101(10), convert to decimal from base "10" =
10^5 10^4 10^3 10^2 10^1 10^0
100000 10000 1000 100 10 1
0 1 0 1 0 1

0 +10000 +0 +100 +0 +1
= 10101 base 10 [ of course! ]
c) Given the digits 010101(8), convert to decimal from base "8" =
8^5 8^4 8^3 8^2 8^1 8^0
32768 4096 512 64 8 1
0 1 0 1 0 1

0 + 4096 +0 +64 +0 + 1
= 4161 base 10
d) Given the digits 010101(2), convert to decimal from base "2" =
2^5 2^4 2^3 2^2 2^1 2^0
32 16 8 4 2 1
0 1 0 1 0 1

0 + 16 + 0 + 4 + 0 + 1
= 21 base 10
e) Given the digits 010101(2), convert to decimal from base "2" =
2^5 2^4 2^3 2^2 2^1 2^0
32 16 8 4 2 1
0 1 0 1 0 1

0 + 16 + 0 + 4 + 0 + 1
= 21 base 10
f) Given the digits 010101(2), convert to decimal from bias127 =
Convert binary to unsigned decimal, then subtract 127:
2^5 2^4 2^3 2^2 2^1 2^0
32 16 8 4 2 1
0 1 0 1 0 1

0 + 16 + 0 + 4 + 0 + 1
= 21  127 = 106 base 10
g) Given the digits 010101(2), convert to decimal from bias63 =
Convert binary to unsigned decimal, then subtract 63:
2^5 2^4 2^3 2^2 2^1 2^0
32 16 8 4 2 1
0 1 0 1 0 1

0 + 16 + 0 + 4 + 0 + 1
= 21  63 = 42 base 10
h) Given the digits 010101(2), convert to decimal from bias31 =
Convert binary to unsigned decimal, then subtract 31:
2^5 2^4 2^3 2^2 2^1 2^0
32 16 8 4 2 1
0 1 0 1 0 1

0 + 16 + 0 + 4 + 0 + 1
= 21  31 = 10 base 10
i) Given the digits 010101(2), convert to decimal from bias16 =
Convert binary to unsigned decimal, then subtract 16:
2^5 2^4 2^3 2^2 2^1 2^0
32 16 8 4 2 1
0 1 0 1 0 1

0 + 16 + 0 + 4 + 0 + 1
= 21  16 = 5 base 10
2. Perform the following additions and subtractions in binary, assuming
a 6 bit word. Show the Result value plus the values of the Zero, Sign,
Carry, and Overflow flag values for each (five answers for each).
(The Zero flag is on iff the Result is zero.)
The "Carry" flag indicates a "Borrow" when doing subtraction
of a big number from a smaller number.
CARRY: 1111
011010 011010
+ 001111  001111
ANS: 101001 001011
Zero = 0 Zero = 0
Sign = 1 Sign = 0
Carry = 0 Carry = 0 (no borrow needed)
Overflow = 1 Overflow = 0
CARRY: 111111
010111 010110
+ 101001  010110
ANS: 000000 000000
Zero = 1 Zero = 1
Sign = 0 Sign = 0
Carry = 1 Carry = 0 (no borrow needed)
Overflow = 0 Overflow = 0
The overflow flag comes on when the answer is wrong for two's
complement. The simple rule to remember is that overflow only
happens when pos+pos=neg or neg+neg=pos. For subtraction, note
that subtracting a positive is the same math as adding a negative,
and subtracting a negative is the same math as adding a positive.
In the above examples, we have:
a) pos+pos=neg is wrong = overflow
b) pospos=pos is fine = no overflow
c) pos+neg=pos is fine = no overflow
d) pospos=pos is fine = no overflow
Above, "pospos" is the same math as "pos+neg" (no overflow possible).
"negneg" would be the same math as "neg+pos" (no overflow possible).
3. What is the minimum number of binary bits needed to represent the number
of each day in the year (the Julian day number)?
Need to represent 1 through 366. 2**8=256 and 2**9=512, so choose 9 bits.
4. What is the minimum number of binary bits needed to represent the
number of each day in the year, if the number of days can be positive
or negative (e.g. "minus 300 days" or "today  300")?
Need to represent 366 through +366. Using two's complement (and the
formula from the last assignment), we choose 10 bits with range:
(2**9) to +((2**9)1) or 512 to +511
5. Unix/Linux has traditionally used a 32bit signed integer to
store the number of seconds since midnight on January 1, 1970, UTC.
Calculate roughly in what year/month/day this value overflows and
time starts going negative.
32 bits signed means the max positive seconds is +((2**31)1) or
2,147,483,647 seconds.
A year is about 365.25 days or 31,557,600 seconds.
2,147,483,647 / 31,557,600 = 68.049650385 years.
year 1970 + 68 = year 2038
The remainder 0.049650385 years is about 18.134803121 days.
January 1 + 18 = January 19 (2038)
A Wikipedia search confirms the actual date as 03:14:07 UTC on Tuesday,
19 January 2038: http://en.wikipedia.org/wiki/Year_2038_problem
6. If possible, convert the following decimal values into 2's complement
form, assuming a 12bit word. Show your results in both binary
and hexadecimal.
a): 1
1 is negative, so treat as positive and bit flip later
1(10) = 001h as 12bits hex
> FFEh (flip the bits using bitflip table)
> FFFh (add one for two's complement)
FFFh = 1111 1111 1111(2)
(or  just remember that 1 is always "all bits on")
b) +693
693(10) = ??? hex > use powersof16 table: 1,16,256,4096
693 / 256 = 2 rem 181
181 / 16 = 11 rem 5 [ and we write 11(10) = B(16) ]
5 / 1 = 5
+693 = 2B5h = 0010 1011 0101(2)
c) 693
693 is negative so treat as positive and bit flip later
+693(10) = 2B5h (from above)
> D4Ah (flip the bits using bitflip table)
> D4Bh (add one for two's complement)
693 = D4Bh = 1101 0100 1011(2)
d) +2048
2048(10) = ??? hex > use powersof16 table: 1,16,256,4096
2048 / 256 = 8
0 / 16 = 0
0 / 1 = 0
= 800h which is negative in 12 bits!
> 2048 is too big to fit in 12 bits two's complement
> max (using formula) is +((2**11)1) = +2047
> too big!
e) 2048
2048 is negative so treat as positive and bit flip later
+2048(10) = 800h (from above)
> 7FFh (flip the bits using bitflip table)
> 800h (add one for two's complement)
2048 = 800h = 1000 0000 0000(2)
(or  just remember that the most negative number has the
sign bit on and nothing else in two's complement)
f) +4097
> doesn't fit in 12 bits (use the formula to know this)
7. Perform the indicated arithmetic in hexadecimal, assuming a 12bit word.
Show the hexadecimal result plus the states of the Zero, Sign, Carry
and Overflow flags (five answers for each problem).
The "Carry" flag indicates a "Borrow" when doing subtraction
of a big number from a smaller number.
CARRY: 111 111
D8A 948 C8B ACE
+276 35A +839 BDF

000 5EE 4C4 EEF
Zero: on off off off
Sign: off off off on
Carry: on off on on
Overflow: off on* on off
(*) Subtracting a positive is the same as adding a negative,
and adding two negatives must give a negative, not a positive.
Or, consider that subtracting a positive from a negative must
generate a more negative number, not a positive number.
The overflow flag comes on when the answer is wrong for two's
complement. The simple rule to remember is that overflow only
happens when pos+pos=neg or neg+neg=pos. For subtraction, note
that subtracting a positive is the same math as adding a negative,
and subtracting a negative is the same math as adding a positive.
In the above examples, we have:
a) neg+pos=pos is fine = no overflow
b) negpos=pos is wrong = overflow
c) neg+neg=pos is wrong = overflow
d) negneg=neg is fine = no overflow
Above, "negpos" is the same math as "neg+neg" and overflow is possible,
and "negneg" is be the same math as "neg+pos" (no overflow possible).
8. Express floatingpoint 123.456 as a normalized decimal number using
scientific notation with four digits of precision.
Normalized: 1.23456 x 10**2
Four digits: 1.234 x 10**2 (or 1.235 with rounding)
9. Add floatingpoint decimal 1234000.0 to 1.5 and express the result as
a normalized decimal number using scientific notation with four
digits of precision.
1234000.0 + 1.5 = 1234001.5
Normalized: 1.2340015 x 10**6
Four digits: 1.234 x 10**6
10. Add *binary* floatingpoint 1111000.0 to 1.1 and express the result
as a normalized binary number using (binary) scientific notation
with four (binary) digits of precision.
1111000.0 + 1.1 = 1111001.1
Normalized: 1.1110011 x 2**6
Four digits: 1.111 x 2**6
11. Looking at the two previous questions, is it possible in a computer
to add a number to a floatingpoint number without having any effect,
i.e. is it true that A+B=B for certain floatingpoint values of A and B?
Yes. Both previous questions show cases where A+B=A when A is big
and B is small. Because the number of bits of precision is fixed
in ordinary floatingpoint arithmetic inside a computer, there will
be some arithmetic that will not have enough precision to represent
the true answer. Adding two values of greatly differing magnitudes
(e.g. add 3 to 10**99) usually leaves the larger number unchanged,
because there are not enough bits of precision to represent the
small number being added to it.
12. Encode the decimal value +274.5625 as a 32bit IEEE754 floating
point field and show your final answer in hexadecimal.
274.5625(10) = 100010010.1001(2) [ see previous labs for how ]
Normalized: 1.000100101001 x 2**8
Mantissa part: .000100101001 (drop the leading 1.)
 pad on the right with zeroes to fill up 23 bits:
00010010100100000000000
Exponent part: 8
 excess127 notation means add 127 before we convert to binary:
8+127 = 135 = 128+7= 10000111(2)
Sign: 0 (positive)
In IEEE 754 singleprecision (32bit) format (1+8+23 bits):
= 0 10000111 00010010100100000000000
= 0100 0011 1000 1001 0100 1000 0000 0000
= 4 3 8 9 4 8 0 0
= 43894800h
13. Encode the decimal value 12.1875 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
1. Number is negative so the sign bit will be 1
2. Convert 12.1875 to binary 1100.0011(2)
3. Normalize the binary number 1.1000011 * 2**3
4. The binary digits to the right of the decimal become the mantissa.
Pad to the right with zeroes to fill up 23 bits:
10000110000000000000000
5. The exponent is 3. Bias it with 127 and it becomes 3+127 = 130.
Convert 130 to binary becomes 10000010 (128+2)
6. Put it all together in 1+8+23=32 bits like this:
= 1 10000010 10000110000000000000000
= 1100 0001 0100 0011 0000 0000 0000 0000
= C 1 4 3 0 0 0 0
= C1430000h
14. Encode the decimal value +0.0 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
+Zero is a special number with allzero bits: 00000000h
15. Encode the decimal value 0.0 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
Zero is a special number with allzero bits except the sign: 80000000h
16. Encode the decimal value +1.0 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
1. 1.0(10) = 1.0(2)
2. Normalized = 1.0 x 2**0 (exponent is zero)
3. Exponent is 0.
Bias it with 127 > 0+127 = 127 = (1281) = 01111111(2)
4. The binary digits to the right of the decimal become the mantissa.
Pad to the right with zeroes to fill up 23 bits:
00000000000000000000000
5. Sign is 0
Exponent is 01111111
Mantissa is 0000000000000000000000
Result 0 01111111 00000000000000000000000
6. Grouping 0011 1111 1000 0000 0000 0000 0000 0000
= 3 F 8 0 0 0 0 0
= 3F800000h
17. Encode the decimal value 1.0 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
As for +1 (3F800000h), except turn on the negative sign bit:
= BF800000h
18. Encode the decimal value +2.0 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
Compare with +1.0 (from above):
+1.0 = 1.0(2) x 2**0
+2.0 = 10.0(2)
= 1.0(2) x 2**1
So do as above for +1, except with an exponent one greater.
Previous exponent plus one = 01111111+1 = 10000000(2)
5. Sign is 0
Exponent is 10000000
Mantissa is 0000000000000000000000
Result 0 10000000 00000000000000000000000
6. Grouping 0100 0000 0000 0000 0000 0000 0000 0000
= 4 0 0 0 0 0 0 0
= 40000000h
19. Encode the decimal value 2.0 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
As for +2 (40000000h), except turn on the negative sign bit:
= C0000000h
20. Encode the decimal value +4.0 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
Compare with +2.0 (from above):
+2.0 = 1.0(2) x 2**1
+4.0 = 100.0(2)
= 1.0(2) x 2**2
So do as above for +2, except with an exponent one greater.
Previous exponent plus one = 10000000+1 = 10000001(2)
5. Sign is 0
Exponent is 10000001
Mantissa is 0000000000000000000000
Result 0 10000001 00000000000000000000000
6. Grouping 0100 0000 1000 0000 0000 0000 0000 0000
= 4 0 8 0 0 0 0 0
= 40800000h
21. Encode the decimal value 4.0 as 32bit IEEE754 floating point
field and show your answer in hexadecimal.
As for +4 (40800000h), except turn on the negative sign bit:
= C0800000h
22. Assuming the following eightbyte hex dump contains two BigEndian,
32bit, IEEE754 encoded values: C2 2D C0 00 3F 60 00 00
decode both values shown in this dump as separate decimal values.
The two numbers are C22DC000h and 3F600000h
1. Write out C22DC000h in binary:
C 2 2 D C 0 0 0
1100 0010 0010 1101 1100 0000 0000 0000
2. Regroup as 1,8,23 bit pieces:
1 10000100 01011011100000000000000
sign is negative
exponent is 10000100
mantissa is 01011011100000000000000
3. Add back the hidden 1. to the left of the mantissa:
1.01011011100000000000000
4. Convert the exponent to decimal.
10000100(2) = 128+4 = 132.
Unbias the exponent by removing the excess 127:
132127 = 5
Thus the original exponent factor was 2**5
5. Denormalize the mantissa using the exponent:
1.01011011100000000000000 x 2**5
= 101011.0111 x 2**0
6. Convert the denormalized binary fraction to decimal.
101011.0111(2) = 32+8+2+1 + 0.250+0.125+0.0625 = 43.4375(10)
Add the minus sign: 43.4375(10)
1. Write out 3F600000h in binary:
3 F 6 0 0 0 0 0
0011 1111 0110 0000 0000 0000 0000 0000
2. Regroup as 1,8,23 bit pieces:
0 01111110 11000000000000000000000
sign is positive
Exponent is 01111110
Mantissa is 11000000000000000000000
3. Add back the hidden 1. to the left of the mantissa:
1.11000000000000000000000
4. Convert the exponent to decimal.
01111110(2) = 1282 = 126.
Unbias the exponent by removing the excess 127:
126127 = 1
Thus the original exponent factor was 2**(1)
5. Denormalize the mantissa using the exponent:
1.11000000000000000000000 x 2**(1)
= 0.11100000000000000000000 x 2**0
6. Convert the denormalized binary fraction to decimal.
0.111 = 0.5+0.25+0.125 = 0.875
Number is positive: +0.875
23. The IEEE 754 floatingpoint number 81234567h is negative. Without
converting, give the hexadecimal for the same number, only positive.
Turn off the sign bit: 81234567h > 01234567h
24. The IEEE 754 floatingpoint number 7EDCBA98h is positive. Without
converting, give the hexadecimal for the same number, only negative.
Turn on the sign bit: 7EDCBA98h > FEDCBA98h
25. Without converting, cross out or delete all the IEEE 754 negative numbers,
leaving only the positive numbers:
1837A654h 7A6A3B65h 87B5CDE2h 90A5B5EFh A0000037h D1B8765Ah F0000000h
1837A654h 7A6A3B65h XXXXXXXXX XXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX
26. How does the answer to the previous question change if you are told that
all the bit patterns are really IEEE 754 doubleprecision numbers?
If 87B5CDE2h is doubleprecision, it is at 64bit number, i.e. it should
be written as 0000000087B5CDE2h. The sign bit is off. All the given
bit patterns are positive numbers if taken as doubleprecision 64bit,
because they all start with a zero sign bit (in 64 bits).
27. Which has more *Precision* available  a 32bit integer or a 32bit
floatingpoint number?
Integers have 32 bits of precision. Singleprecision floats only
have about 24 bits of mantissa. Integers have more precision.
28. Which has more *Range* available,  a 32bit integer or a 32bit
floatingpoint number?
32bit integers only range plus or minus 2**31 (approximately).
32bit floating point can range plus or minus 2**126 (approximately).
Floating point has more range.
29. IEEE 754 singleprecision floatingpoint can store numbers in the
approximate range of 2**127 to +2**+127. Look up or use a calculator
to express this range (approximately) as powers of ten (decimal).
"Approximately plus/minus 10**38"
30. True/False: Decimal 1234.0 x 10**37 fits in IEEE 754 single precision
floating point.
1234.0 x 10**37 = 1.234 x 10**40 which exceeds 10**38  FALSE.
31. True/False: Decimal 0.00001 x 10**40 fits in IEEE 754 single
precision floating point.
0.00001 x 10**40 = 1.0 x 10**35 which fits in 10**38  TRUE.
32. Crossout or delete the values that fit in IEEE 754 single precision
floating point with no loss of range or precision, leaving only the
values that do *not* fit completely accurately:
2**303 2**301 2**30 2**30+1 2**30+3 2**30+2**29
All the numbers fit within the 2**126 exponent range of IEEE 754.
Cross out anything that fits within 23 bits of precision:
2**303 2**301 XXXXX 2**30+1 2**30+3 XXXXXXXXXXX
For example, 2**301 is 111111111111111111111111111111(2) which
is 1.11111111111111111111111111111 x 2**29 and needs 28 bits of precision.
Doesn't fit in a 23bit IEEE mantissa.
For example, 2**30+2**29 is 1100000000000000000000000000000(2) = 1.1 x 2**30
which only needs two bits of precision (1.1 x 2**30). Fits.
33. Without converting, crossout or delete the sums that fit in IEEE 754
singleprecision floatingpoint with no loss of range or precision,
leaving only the sums that do *not* fit accurately:
2**29+2**10+2**9+2**0 2**26+2**0 2**29+2**28+2**27+2**26
2**27+2**23+2**1 2**29+2**28+2**2+2**1
All the numbers fit within the 2**126 exponent range of IEEE 754.
Cross out anything that fits within 23 bits of precision:
2**29+2**10+2**9+2**0 2**26+2**0 XXXXXXXXXXXXXXXXXXXXXXX
2**27+2**23+2**1 2**29+2**28+2**2+2**1
For example, 2**26+2**0 needs 27 bits of precision:
= 100000000000000000000000001(2) = 1.00000000000000000000000001 x 2**26
For example, 2**29+2**28+2**27+2**26 only needs four bits of precision:
= 1.111 x 2**29
34. Why do the decimal numbers 2147483775 (0x8000007F) and 2147483648
(0x80000000) both convert to the same IEEE 754 singleprecision
floatingpoint number 0x4F000000 that has decimal value 2147483648.0?
The number 2147483775 (0x8000007F) requires 32 bits of precision.
The last (rightmost) 9 bits of precision are thrown away when
converting to IEEE 754 singleprecision, so the "7F" part of
0x8000007F disappears and it looks just like 0x800000, which converts
back to 2147483648.0 decimal, not to 2147483775. You lose precision
when converting a 32bit integer into a 23bit mantissa.
In other words:
2147483775 in binary is 10000000000000000000000001111111
2147483648 in binary is 10000000000000000000000000000000
When you convert both of these numbers to IEEE 754 single precision
floating point number, the mantissa only holds 23 of those bits.
What is stored in the mantissa for 2147483775 is
(1.)00000000000000000000000 and the extra 01111111 gets discarded.
What is stored in the mantissa for 2147483648 is
(1.)00000000000000000000000 which is exactly the same as 2147483775.
Even though the numbers are different, since IEEE 754 singleprecision
floatingpoint number only has a precision of 23 bits, both of these
numbers end up being the same when they are converted, because anything
beyond 23bits is discarded.
35. Explain why, in a computer, floating point mathematics may not be
associative or distributive, i.e. (A+B)+C may not equal A+(B+C).
Floating point arithmetic can lose precision when small numbers
are added to or subtracted from big numbers. If you arrange your
mathematics so that the small numbers are added to each other
first, they stand a better chance of affecting the bigger number.
Order matters.
36. How close to zero can you get with IEEE 754 32bit floating point?
(What is the nonzero value that is closest to zero?) Express the
answer in both approximate poweroftwo notation and in approximate
poweroften notation.
Approximately 2**(126) (normalized IEEE) which is approximately 10**(38).
Denormalized IEEE 754 numbers can get closer to zero at the expense
of some precision. Remember 10**(38) and you won't be far wrong.

 Ian! D. Allen  idallen@idallen.ca  Ottawa, Ontario, Canada
 Home Page: http://idallen.com/ Contact Improv: http://contactimprov.ca/
 College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
 Defend digital freedom: http://eff.org/ and have fun: http://fools.ca/