Character Encoding / Line Ends

Computers use binary bit patterns to represent not only numbers, but also characters such as digits, letters, and punctuation. A text file contains binary bit patterns that map to printable characters according to some mapping table. While the binary bit pattern all-zeroes, 00000000, usually represents zero when encoded as an integer, a printable zero character digit, i.e. '0', is not usually encoded in a text file using the same zero bit pattern.

For example, the ASCII character code uses the 7-bit pattern 0x30 to encode a printable zero digit character. EBCDIC uses the 8-bit pattern 0xF0 to encode the same digit. These are clearly not the same bit patterns used to represent an integer value of zero. To display the number 12 on the screen using the ASCII code, the character encoding for the digit '1' would be sent first (0x31) followed by the character encoding for the digit '2' (0x32).

Originally, here in North America, the bit patterns being used to encode characters only handled English. English only needed a 7-bit or 8-bit character size to include all the letters, digits, and common punctuation, so these character bit patterns all fit nicely into one computer "byte".

The mapping of bit pattern to printable character has gotten complex over the past decades due to the introduction of more and more different mappings to include more and more of the world's languages, current and past. (Not all the world speaks English!) For many years, the industry was reluctant to break the rule "one-character, one-byte", so many mutually incompatible 8-bit character mappings were developed to handle different languages in different parts of the world. The same 8-bit pattern might map to one printable character in Norway, and a different character in France or Greece. Creating a file containing both French and Greek characters was impossible.

In 1991, a Universal Character set Unicode was introduced, using a 16-bit (two-byte) character format, with later updates to permit extensions to 32-bit characters as needed. This 16-bit character set broke the "one-character, one-byte" rule. It was incompatible with all one-byte character systems used to date, and thus rendered much text manipulation software (sorting, indexing, etc.) useless.

In 1993, some Americans introduced an UTF-8, an 8-bit variable-length version of Unicode that was backwards-compatible with ASCII, and thus with English. If you didn't use any non-English characters in your file, the file format was plain 7-bit ASCII. Only when you needed a foreign Unicode character did you have to resort to some 8-bit encoding sequences. Most software that expected ASCII could handle UTF-8 equally well. UTF-8 has become very popular in North America, since it treats ASCII as ASCII with no complications.

Character encodings:

Character Encoding : ASCII

Students should know from memory the basic layout of the letters and digits in the 7-bit ASCII character encoding table. What region of the table contains unprintable control characters? What is the ASCII value of a space? the letters "a" and "A"? What is the lowest standard-ASCII (7-bit) character's name and bit pattern? What is the highest standard-ASCII (7-bit) character's name and bit pattern?


The American Standard Code for Information Interchange (ASCII) coding scheme was developed as a 7-bit code. A 7-bit code provides enough different bit patterns (128) to permit a coding scheme for all the upper- and lower-case characters found on a standard English language keyboard, plus punctuation and some unprintable device control characters (e.g. Newline, Carriage Return, Bell, etc.).

Seven-bit ASCII encoding is normally used in 8-bit bytes with the top (leftmost) bit set to zero. Some extended encodings based on ASCII use the top bit set to include an additional 128 characters, e.g. ISO-8859-1 (Latin-1) is an 8-bit standard that includes the accented letters needed for Western European languages (including French).

Before the development of standards for extended-ASCII encodings, each manufacturer of computer equipment used different incompatible choices for what the extended characters represented. Files written on one machine don't display properly on another.

Minimal Sizes for Codes Representing Characters

The Major ASCII Codes and Rules

The Full ASCII Table

Most ASCII encoding/decoding can be performed without tables by knowing a few base codes: the blank, the letter "A", the digit "0", the carriage-return, and the line-feed. The rest of the letters and digits can be figured out from these base codes.

Low\Hi

Nybbles

0 1 2 3 4 5 6 7
0 NUL DLE SPACE 0 @ P ` p
1 SOH DC1 ! 1 A Q a q
2 STX DC2 " 2 B R b r
3 ETX DC3 # 3 C S c s
4 EOT DC4 $ 4 D T d t
5 ENQ NAK % 5 E U e u
6 ACK SYNC & 6 F V f v
7 BEL ETB ' 7 G W g w
8 BS CAN ( 8 H X h x
9 HT EM ) 9 I Y i y
A LF SUB * : J Z j z
B VT ESC + ; K [ k {
C FF FS ' < L \ l |
D CR GS - = M ] m }
E SO RS . > N ^ n ~
F SI US / ? O _ o DEL

ASCII Files

Summary of line terminators

ASCII-encoded files are usually composed of variable length lines of characters. Each line is terminated with one or more unprintable characters. The exact character or characters used at the end of each line depends on what computer system you are using.

a single carriage-return character:
Used to end lines on pre-OSX Apple Macintosh systems.
a single line-feed character:
Used to end lines on Unix/Linux/BSD systems and modern OSX systems.
a carriage-return followed by a line-feed:
Used to end lines on Microsoft MS-DOS and Windows systems.

When using a file-transfer program to move text (not binary) files between machines, you must know the consequences of incompatible line-end terminators.

Example of ASCII File Decoding



Basic Character Encoding : EBCDIC

EBCDIC material does not need to be memorized

You may need to decode/encode EBCDIC in an assignment


Character encoded data on IBM mainframe computers is normally based on a scheme called EBCDIC. The EBCDIC character encoding preceded the ASCII encoding. EBCDIC was developed from a basis the involved the computer punched card and has features that, to be properly understood, require a knowledge of that historical medium.

EBDIC encoded files normally contain fixed-length records.


The Punched Card and Hollerith Codes

EBCDIC Codes (Basic Codes)

Standard EBCDIC Files

EBCDIC vs. ASCII Character Sequences

For a side-by-side comparison, see: http://www.natural-innovations.com/computing/asciiebcdic.html

Q: If you examine an EBCDIC text file copied byte-for-byte onto an ASCII system such as Unix/Linux or DOS/Windows/Macintosh, what will you see on your ASCII screen? (Hints: [1] Do the EBCDIC letters and numbers match any printable 7-bit ASCII characters? [2] Do EBCDIC sentence punctuation and space characters match any printable ASCII characters?)