Basic Character Encoding : ASCII
Students should know from memory the basic layout of the ASCII character
encoding table.
What region of the table contains unprintable control characters?
What is the ASCII value of a space? the letters "a" and "A"?
What is the lowest standard-ASCII (7-bit) character's name and bit pattern?
What is the highest standard-ASCII (7-bit) character's name and bit pattern?
A 7-bit code would provide enough different patterns to permit a coding scheme for all the
characters found on a standard English language keyboard (and allow for both upper and lower
case letters). The ASCII coding scheme was developed as such a 7-bit code. In most cases,
ASCII encoding is normally used in 8-bit bytes now; but only those codes with the left-most bit
set to 0 and the remaining 7-bits in the original coding scheme are standard.
A vast majority of ASCII encoding/decoding can be performed by knowing a few "base" codes:
the blank, the letter "A", the digit "0", the carriage-return, and the line-feed.
ASCII encoded files are usually composed of variable length "lines" terminated with
"carriage-return"'s (Macintosh) or "line-feed"'s (Unix) or with "carriage-return"-"line-feed" pairs (MS-DOS/Windows).
Minimal Sizes for Codes Representing Characters
- Bytes
A byte is the collection of bits used by a particular computer for the most common character encoding scheme used by that computer. The most common byte size is 8 bits, but 6, 7, 9, and 12-bit bytes are used by some (different) computer systems.
- How Many Characters Are There?
- 10 decimal digit characters
- 26 letters (or 52 if different codes required for upper and lower case)
- between 10 and 30 "special" characters
- between 2 and 30 non-display "control" codes
- a minimal scheme would require at least 48 codes
- a complete system would require at least 122 codes)
- a minimal scheme could be handled with a 6-bit code, since 6-bits provides 64 different patterns
- a "complete" scheme would require (at least) a 7-bit code; 7-bits give 128 different patterns
The Major ASCII Codes and Rules
- 7-Bit Standard - The standard ASCII encoding scheme is defined for 7-bit values. For computer systems using a larger than 7-bit byte, high order bits are zero for characters defined by the ASCII standard. Often high order bits (beyond the 7th bit) are used to provide additional "special" characters, but there is no standardization of the meaning of these codes.
- Blank / Space - 20h; codes below this value are used only for non-display, control codes
- Decimal Digits - 30h to 39h for "0" to "9" respectively
- Upper-case Alphabetic - 41h to 5Ah for "A" to "Z" respectively
- Lower-case Alphabetic - calculated as the value of a space greater than the corresponding upper-case code i.e. 61h to 7Ah for "a" to "z" respectively
- Special Display Characters - in the "gaps" between 20h and 7Fh not assigned to digits or letters
- Control Characters - 00h to 1Fh; includes: 0Dh, the "carriage return" , and 0Ah, the "line feed"
The Full ASCII Table
Low\Hi
Nybbles |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
0 |
NUL |
DLE |
SPACE |
0 |
@ |
P |
` |
p |
1 |
SOH |
DC1 |
! |
1 |
A |
Q |
a |
q |
2 |
STX |
DC2 |
" |
2 |
B |
R |
b |
r |
3 |
ETX |
DC3 |
# |
3 |
C |
S |
c |
s |
4 |
EOT |
DC4 |
$ |
4 |
D |
T |
d |
t |
5 |
ENQ |
NAK |
% |
5 |
E |
U |
e |
u |
6 |
ACK |
SYNC |
& |
6 |
F |
V |
f |
v |
7 |
BEL |
ETB |
' |
7 |
G |
W |
g |
w |
8 |
BS |
CAN |
( |
8 |
H |
X |
h |
x |
9 |
HT |
EM |
) |
9 |
I |
Y |
i |
y |
A |
LF |
SUB |
* |
: |
J |
Z |
j |
z |
B |
VT |
ESC |
+ |
; |
K |
[ |
k |
{ |
C |
FF |
FS |
' |
< |
L |
\ |
l |
| |
D |
CR |
GS |
- |
= |
M |
] |
m |
} |
E |
SO |
RS |
. |
> |
N |
^ |
n |
~ |
F |
SI |
US |
/ |
? |
O |
_ |
o |
DEL |
ASCII Files
- Line Terminated Files (vs. Fixed-length and Run-length)
- Standard ASCII files are "line terminated" files; that is, the lines (or "records") that make up a standard ASCII file can be separated from each other by means of one or more special control codes appended to each line.
- Fixed-length encoded files have records which are all the same length; for character files, this means that short lines need to be "padded" to the full line or "record" length with additional blank characters, and long lines need to be truncated or "wrapped around" into another line; nothing in the actual file indicates how the file is to be divided into lines or records.
- Run-length encoded files have character-count fields at the beginning of each line (or "record").
- Unix vs. MS-DOS Line Terminators - both Unix and
MS-DOS (and Windows) make use of ASCII encoded files; however, the
standard used for line termination is slightly different. For Unix,
lines are terminated with a single "line feed" (0Ah) code. For
MS-DOS (and Windows), lines are terminated with a "carriage return"
(0Dh) and "line feed" (0Ah) pair of codes.
Why it matters - If you use a package like
FTP to download a web-based "Text" (i.e. ASCII) file (where Unix is the
assumed operating system) to an MS-DOS based system, the FTP process will
automatically scan the file and replace each "line feed" character
with a "carriage return"-"line feed" pair. If the file originated from
an MS-DOS based system (and was uploaded to the web), then it already has
a "carriage return" for each line. As a result, the downloaded file would now
have two "carriage return" codes and a "line feed" code; for many editors
this will cause the file to appear to be double spaced. (One way around
this is to transfer your files as "Binary" instead of "Text" so FTP does
not attempt any character expansion).
Example of ASCII File Decoding
Basic Character Encoding : EBCDIC
EBCDIC material does not need to be memorized
You may need to decode/encode EBCDIC in an assignment
Character encoded data on IBM mainframe computers is normally based on a
scheme called EBCDIC.
The EBCDIC character encoding preceded the ASCII encoding.
EBCDIC was developed from a basis the involved the computer punched card and has
features that, to be properly understood, require a knowledge of that
historical medium.
EBDIC encoded files normally contain fixed-length records.
The Punched Card and Hollerith Codes
- Numeric Requirements: Digits and Signs
- Initially data processing was limited to numbers
- 10 digit symbols were required (0 to 9)
- 2 possible "signs" were required (+ or -)
- Punched Card Structure
- 80 columns wide
- each column could "hold" one of the numeric symbols required as a hole punched in the column, different distances from the top of the card for each symbol; there were 12 possible punch locations in each column resulting in 12 rows of possible punch locations across the entire card
- a punch in the top-most row of a column represented a plus sign (+)
- a punch in the second row from the top represented a minus sign (-)
- a punch in the third row from the top represented a zero (0) and so on from there until the bottom row was used to represent a nine (9)
- multi-digit values were normally punched with the sign of the number and the last digit of the number punched in the same column; this reduced the number of columns required for numbers and could be used as a marker for the end of a numeric value, separating contiguous values on the card
- Alphabetic Requirements: Digits and Zones
- using two punced holes in a single column (one in the top 3 rows, called the "zone" rows, and one in the bottom 9 rows, called the "digit" rows) provided enough codes to support a 26 letter alphabet (with an "extra" punch combination left over)
- a punch in the (top) plus(+) zone row and a punch in one of the 1 to 9 rows was used to represent the (upper case) letters from "A" to "I" respectively
- a punch in the minus(-) zone row and a punch in one of the 1 to 9 rows was used to represent the (upper case) letters from "J" to "R" respectively
- a punch in the zero(0) zone row and a punch in one of the 2 to 9 rows was used to represent the (upper case) letters from "S" to "Z" respectively; notice the zero-one punch combination was not used (perhaps because the card tended to tear if you had two holes that close together)
EBCDIC Codes (Basic Codes)
- The Space / Blank
- hexadecimal code : 40h (twice as good as ASCII?)
- Digit Codes
- hexadecimal codes : F0h (for "0") to F9h (for "9")
- Alphabetic Codes
- the zone punch was encoded in the first hexadecimal digit of a byte with Ch being used for the plus-punch, Dh being used for the minus-punch, and Eh being used for the zero-punch
- the digit punch was encoded in the second hexadecimal digit of the byte using the decimal value of the punch location
- "A" to "I" were encoded as (hexadecimal) C1h to C9h
- "J" to "R" were encoded as (hexadecimal) D1h to D9h
- "S" to "Z" were encoded as (hexadecimal) E2h to E9h
- notice this produces an encoding scheme with "gaps" in the code values that are not used for alphabetic characters (namely CAh to D0h and DAh to E1h inclusive)
- lower case letters were defined as the corresponding upper case letter minus the code value for a space (similar to, but infact the reverse of, the ASCII system)
Standard EBCDIC Files
- Fixed-length Records
- standard EBCDIC encoded files are composed of fixed length records (with 80 characters, just like a punched card, still being the most common length)
- short lines are "padded" on the right with blanks to fill out to the fixed length for a specific file (again, this is most often 80 characters)
- lines longer than the file's fixed record size must be split to form (at least) two lines (records)
- "carriage return" and "line feed" codes are not used
- Record Length Information
- the record length used when the EBCDIC file was created is saved (on an IBM mainframe system) in the VTOC, Volume Table Of Contents, entry for the file (the VTOC is equivalent to an MS-DOS directory); there is no indication of the record size anywhere in the actual file itself.
- it is the programmer's responsibility to code programs so that they use or ask for the proper record length; if a file were created with 80 character EBCDIC records and the programmer's code read 60 character records, the first "read" would get the first 60 characters of the first record, the second "read" would get the last 20 charcters of the first record followed by the first 40 characters of the second record, the third "read" would get the last 40 characters of the second record plus the first 20 characters of the third record, and so on...
EBCDIC vs. ASCII Character Sequences
- in EBCDIC: lower-case letters precede (are less than) upper-case letters which, in turn, precede digits
- in ASCII: digits precede (are less than) upper-case letters which, in turn, precede lower-case letters
- For example, if telephone directory listings were created from two different systems, an ASCII-based system and an EBCDIC-based system, certain company names would occur in different places in the listing, if the listing were
sorted by numeric character code:
ASCII sorted directory listing
- 1-for-All Rentals
- A1 Movers
- alpha-1 Insurance
EBCDIC sorted directory listing
- alpha-1 Insurance
- A1 Movers
- 1-for-All Rentals
For a side-by-side comparison, see: http://www.natural-innovations.com/computing/asciiebcdic.html
Q: If you examine an EBCDIC text file copied byte-for-byte onto an ASCII
system such as Unix/Linux or DOS/Windows/Macintosh, what will you see on
your ASCII screen? (Hints: [1] Do the EBCDIC letters and numbers match
any printable 7-bit ASCII characters? [2] Do EBCDIC sentence punctuation
and space characters match any printable ASCII characters?)