Character Encoding / Line Ends

Computers use binary bit patterns to represent, not only numbers, but also characters. A text file contains binary bit patterns that map to printable characters according to some mapping table. While the binary bit pattern all-zeroes, 00000000, usually represents an integer zero, a printable zero character, i.e. '0', usually is not encoded in a text file using the zero bit pattern. (For example, ASCII uses the 7-bit pattern 0x30 to encode a printable zero digit. EBCDIC uses the 8-bit pattern 0xF0. These are not the same bit patterns used to represent the integer value zero.)

Originally, here in North America, the bit patterns being used for characters only handled English. English only needed a 7-bit or 8-bit character size to include all the letters, digits, and common punctuation, so these character bit patterns all fit nicely into one computer "byte".

The mapping of bit pattern to printable character has gotten complex over the past decades due to the introduction of more and more different mappings to include more and more of the world's languages, current and past. (Not all the world speaks English!) For many years, the field was reluctant to break the rule "one-character, one-byte", so many mutually incompatible 8-bit character mappings were developed to handle different languages in different parts of the world. The same 8-bit pattern might map to one printable character in Norway, and a different character in France or Greece. Creating a file containing both French and Greek characters was impossible.

In 1991, a Universal Character set Unicode was introduced, using a 16-bit (two-byte) character format, with later updates to permit extensions to 32-bit characters as needed. This 16-bit character set broke the "one-character, one-byte" rule. It was incompatible with all one-byte character systems used to date, and thus rendered much text manipulation software (sorting, indexing, etc.) useless.

In 1993, some Americans introduced an UTF-8, an 8-bit version of Unicode that was backwards-compatible with ASCII, and thus with English. If you didn't use any non-English characters in your file, the file format was plain 7-bit ASCII. Only when you needed a foreign Unicode character did you have to resort to some 8-bit encoding sequences. Most software that expected ASCII could handle UTF-8 equally well. UTF-8 has become very popular in North America, since it treats ASCII as ASCII with no complications.

Character encodings:

ASCII - 1963 - 7 bit - English only
EBCDIC - 1963 - 8 bit - IBM Mainframe (e.g. Cobol) - English only
Latin-1 - 1985 - 8 bit - Western European (includes ASCII as a subset)
Unicode - 1991 - 16-bit - Universal - all languages
UTF-8 - 1993 - 8-bit version of Unicode that is backwards-compatible with 7-bit ASCII

Character Encoding : ASCII

Students should know from memory the basic layout of the letters and digits in the 7-bit ASCII character encoding table. What region of the table contains unprintable control characters? What is the ASCII value of a space? the letters "a" and "A"? What is the lowest standard-ASCII (7-bit) character's name and bit pattern? What is the highest standard-ASCII (7-bit) character's name and bit pattern?

The American Standard Code for Information Interchange (ASCII) coding scheme was developed as a 7-bit code. A 7-bit code provides enough different bit patterns (128) to permit a coding scheme for all the upper- and lower-case characters found on a standard English language keyboard, plus punctuation and some unprintable device control characters (e.g. Newline, Carriage Return, Bell, etc.).

Seven-bit ASCII encoding is normally used in 8-bit bytes with the top (leftmost) bit set to zero. Some extended encodings based on ASCII use the top bit set to include an additional 128 characters, e.g. ISO-8859-1 (Latin-1) is an 8-bit standard that includes the accented letters needed for Western European languages (including French).

Before the development of standards for extended-ASCII encodings, each manufacturer of computer equipment used different incompatible choices for what the extended characters represented. Files written on one machine don't display properly on another.

Minimal Sizes for Codes Representing Characters

Bytes
A byte is the collection of bits used by a particular computer for the most common character encoding scheme used by that computer. The most common byte size is 8 bits, but 6, 7, 9, and 12-bit bytes are used by some (different) computer systems.
How Many Characters Are There?
- 10 decimal digit characters
- 26 letters (or 52 if different codes required for upper and lower case)
- between 10 and 30 "special" characters
- between 2 and 30 non-display "control" codes
- a minimal scheme would require at least 48 codes
- a complete system would require at least 122 codes)
- a minimal scheme could be handled with a 6-bit code, since 6-bits provides 64 different patterns
- a "complete" scheme would require (at least) a 7-bit code; 7-bits give 128 different patterns

The Major ASCII Codes and Rules

7-Bit Standard - The standard ASCII encoding scheme is defined only for 7-bit values. For computer systems using a larger than 7-bit byte, high order bits must be zero for characters defined by the ASCII standard. (The high order bits may be used to provide additional extended "special" characters, but these are not part of the ASCII standard and may not display correctly on all computers.)
Blank / Space - 20h; codes below this value are used only for non-display, control codes
Decimal Digits - 30h to 39h for "0" to "9" respectively
Upper-case Alphabetic - 41h to 5Ah for "A" to "Z" respectively
Lower-case Alphabetic - calculated as the value of a space greater than the corresponding upper-case code i.e. 61h to 7Ah for "a" to "z" respectively
Special Display Characters - in the "gaps" between 20h and 7Fh not assigned to digits or letters
Control Characters - 00h to 1Fh; includes: 0Dh, the "carriage return" , and 0Ah, the "line feed"

The Full ASCII Table

Most ASCII encoding/decoding can be performed without tables by knowing a few base codes: the blank, the letter "A", the digit "0", the carriage-return, and the line-feed. The rest of the letters and digits can be figured out from these base codes.

Low\Hi Nybbles	0	1	2	3	4	5	6	7
0	NUL	DLE	SPACE	0	@	P	`	p
1	SOH	DC1	!	1	A	Q	a	q
2	STX	DC2	"	2	B	R	b	r
3	ETX	DC3	#	3	C	S	c	s
4	EOT	DC4	$	4	D	T	d	t
5	ENQ	NAK	%	5	E	U	e	u
6	ACK	SYN	&	6	F	V	f	v
7	BEL	ETB	'	7	G	W	g	w
8	BS	CAN	(	8	H	X	h	x
9	HT	EM	)	9	I	Y	i	y
A	LF	SUB	*	:	J	Z	j	z
B	VT	ESC	+	;	K	[	k	{
C	FF	FS	,	<	L	\	l	\|
D	CR	GS	-	=	M	]	m	}
E	SO	RS	.	>	N	^	n	~
F	SI	US	/	?	O	_	o	DEL

ASCII Files

Line Terminated Files (vs. Fixed-length and Run-length)
- Standard ASCII files are "line terminated" files; that is, the lines (or "records") that make up a standard ASCII file can be separated from each other by means of one or more special control codes appended to each line.
- Fixed-length encoded files have records which are all the same length; for character files, this means that short lines need to be "padded" to the full line or "record" length with additional blank characters, and long lines need to be truncated or "wrapped around" into another line; nothing in the actual file indicates how the file is to be divided into lines or records.
- Run-length encoded files have character-count fields at the beginning of each line (or "record").
Unix vs. MS-DOS Line Terminators - both Unix and MS-DOS (and Windows) make use of ASCII encoded files; however, the standard used for line termination is slightly different. For Unix, lines are terminated with a single "line feed" (0Ah) code. For MS-DOS (and Windows), lines are terminated with a "carriage return" (0Dh) and "line feed" (0Ah) pair of codes.
Why it matters - If you use a package like FTP to download a web-based "Text" (i.e. ASCII) file (where Unix is the assumed operating system) to an MS-DOS based system, the FTP process will automatically scan the file and replace each "line feed" character with a "carriage return"-"line feed" pair. If the file originated from an MS-DOS based system (and was uploaded to the web), then it already has a "carriage return" for each line. As a result, the downloaded file would now have two "carriage return" codes and a "line feed" code; for many editors this will cause the file to appear to be double spaced. (One way around this is to transfer your files as "Binary" instead of "Text" so FTP does not attempt any character expansion).

Summary of line terminators

ASCII-encoded files are usually composed of variable length lines of characters. Each line is terminated with one or more unprintable characters. The exact character or characters used at the end of each line depends on what computer system you are using.

a single carriage-return character:: Used to end lines on Apple Macintosh systems.
a single line-feed character:: Used to end lines on Unix/Linux systems.
a carriage-return followed by a line-feed:: Used to end lines on Microsoft MS-DOS and Windows systems.

When using a file-transfer program to move text (not binary) files between machines, you must know the consequences of incompatible line-end terminators.

Example of ASCII File Decoding

Hexadecimal Dump

41 20 73 69 6D 70 6C 65 0D 0A 33 20 6C 69 6E 65
0D 0A 66 69 6C 65 0D 0A

Decoded Text
```
A simple
3 line
file
```
Since this file contains CR+LF sequences (0D+0A), it is likely a DOS/Windows file.

Basic Character Encoding : EBCDIC

EBCDIC material does not need to be memorized

You may need to decode/encode EBCDIC in an assignment

Character encoded data on IBM mainframe computers is normally based on a scheme called EBCDIC. The EBCDIC character encoding preceded the ASCII encoding. EBCDIC was developed from a basis that involved the computer punched card and has features that, to be properly understood, require a knowledge of that historical medium.

EBCDIC encoded files normally contain fixed-length records.

The Punched Card and Hollerith Codes

Numeric Requirements: Digits and Signs
- Initially data processing was limited to numbers
- 10 digit symbols were required (0 to 9)
- 2 possible "signs" were required (+ or -)
Punched Card Structure
- 80 columns wide
- each column could "hold" one of the numeric symbols required as a hole punched in the column, different distances from the top of the card for each symbol; there were 12 possible punch locations in each column resulting in 12 rows of possible punch locations across the entire card
- a punch in the top-most row of a column represented a plus sign (+)
- a punch in the second row from the top represented a minus sign (-)
- a punch in the third row from the top represented a zero (0) and so on from there until the bottom row was used to represent a nine (9)
- multi-digit values were normally punched with the sign of the number and the last digit of the number punched in the same column; this reduced the number of columns required for numbers and could be used as a marker for the end of a numeric value, separating contiguous values on the card
Alphabetic Requirements: Digits and Zones
- using two punched holes in a single column (one in the top 3 rows, called the "zone" rows, and one in the bottom 9 rows, called the "digit" rows) provided enough codes to support a 26 letter alphabet (with an "extra" punch combination left over)
- a punch in the (top) plus(+) zone row and a punch in one of the 1 to 9 rows was used to represent the (upper case) letters from "A" to "I" respectively
- a punch in the minus(-) zone row and a punch in one of the 1 to 9 rows was used to represent the (upper case) letters from "J" to "R" respectively
- a punch in the zero(0) zone row and a punch in one of the 2 to 9 rows was used to represent the (upper case) letters from "S" to "Z" respectively; notice the zero-one punch combination was not used (perhaps because the card tended to tear if you had two holes that close together)

EBCDIC Codes (Basic Codes)

The Space / Blank
- hexadecimal code : 40h (twice as good as ASCII?)
Digit Codes
- hexadecimal codes : F0h (for "0") to F9h (for "9")
Alphabetic Codes
- the zone punch was encoded in the first hexadecimal digit of a byte with Ch being used for the plus-punch, Dh being used for the minus-punch, and Eh being used for the zero-punch
- the digit punch was encoded in the second hexadecimal digit of the byte using the decimal value of the punch location
- "A" to "I" were encoded as (hexadecimal) C1h to C9h
- "J" to "R" were encoded as (hexadecimal) D1h to D9h
- "S" to "Z" were encoded as (hexadecimal) E2h to E9h
- notice this produces an encoding scheme with "gaps" in the code values that are not used for alphabetic characters (namely CAh to D0h and DAh to E1h inclusive)
- lower case letters were defined as the corresponding upper case letter minus the code value for a space (similar to, but in fact the reverse of, the ASCII system)

Standard EBCDIC Files

Fixed-length Records
- standard EBCDIC encoded files are composed of fixed length records (with 80 characters, just like a punched card, still being the most common length)
- short lines are "padded" on the right with blanks to fill out to the fixed length for a specific file (again, this is most often 80 characters)
- lines longer than the file's fixed record size must be split to form (at least) two lines (records)
- "carriage return" and "line feed" codes are not used
Record Length Information
- the record length used when the EBCDIC file was created is saved (on an IBM mainframe system) in the VTOC, Volume Table Of Contents, entry for the file (the VTOC is equivalent to an MS-DOS directory); there is no indication of the record size anywhere in the actual file itself.
- it is the programmer's responsibility to code programs so that they use or ask for the proper record length; if a file were created with 80 character EBCDIC records and the programmer's code read 60 character records, the first "read" would get the first 60 characters of the first record, the second "read" would get the last 20 characters of the first record followed by the first 40 characters of the second record, the third "read" would get the last 40 characters of the second record plus the first 20 characters of the third record, and so on...

EBCDIC vs. ASCII Character Sequences

in EBCDIC: lower-case letters precede (are less than) upper-case letters which, in turn, precede digits
in ASCII: digits precede (are less than) upper-case letters which, in turn, precede lower-case letters
For example, if telephone directory listings were created from two different systems, an ASCII-based system and an EBCDIC-based system, certain company names would occur in different places in the listing, if the listing were sorted by numeric character code:
1. 1-for-All Rentals
2. A1 Movers
3. alpha-1 Insurance
1. alpha-1 Insurance
2. A1 Movers
3. 1-for-All Rentals

For a side-by-side comparison, see: http://www.natural-innovations.com/computing/asciiebcdic.html

Q: If you examine an EBCDIC text file copied byte-for-byte onto an ASCII system such as Unix/Linux or DOS/Windows/Macintosh, what will you see on your ASCII screen? (Hints: [1] Do the EBCDIC letters and numbers match any printable 7-bit ASCII characters? [2] Do EBCDIC sentence punctuation and space characters match any printable ASCII characters?)

Comments on this Document

From: ib1 at teksavvy dot com
To: idallen@idallen.ca
Subject: 120_CharacterEncoding
Date: 2010-10-16T15:21:27Z

I happened (goofing off, y'know) to wander into
   teaching.idallen.com/dat2343/10f/notes/120_CharacterEncoding.html and
   I saw some errors and oddities on that page.
In the "Character encodings:" overview table, EBCDIC is described as
   "English only".  EBCDIC was and is routinely used with
   more-than-ASCII character sets, using the extra code points for
   non-English letters, APL, line-drawing, etc.  (IBM majorly messed up
   by not imposing cross-company standards for assigning characters
   beyond the basic (ASCII-equivalent) code points, so customers had
   trouble when porting documents between various World Trade divisions,
   and in/out of Domestic.)
In the paragraph starting "Before the development of standards", you say
   that "each manufacturer of computer equipment used different
   incompatible choices".  The situation was actually worse than you
   say, since there were situations where one manufacturer had multiple
   conventions for mapping characters to code points, even within one
   device.
The discussion of text file transfer via FTP conflicts with my
   observations and perception.  If transfers of text files are always
   done in "text" mode, then any combination of FTP programs should do
   the right thing, yielding an instance of the text file with line
   terminators that are correct for the local system.  Trouble happens
   when someone transfers a file in "binary" mode between incompatible
   systems.  (I'm not denying the possibility of a server brokenly
   serving badly-formed data--just saying that "use binary" is
   ill-advised in general, since that's the cause of incompatibility
   problems.)
I don't have statistics on the distribution of the several record
   formats.  However, most of the system files (except assembler and JCL
   source libraries) are undefined (or variable), not fixed.
I don't think the speculation about avoiding the 0-1 punch to protect
   against tearing the card is well founded.  0-1 is the slash
   character, which was used early, long before EBCDIC was defined.  In
   the EBCDIC era, punches in adjacent rows were common--for example,
   object decks can be represented directly on cards; they are heavy
   with 0x00 characters, and 0x00 was punched 12-0-9-8-1.
The heading "Standard EBCDIC Files" and the following paragraph repeat
   the misleading "normally contain fixed-length records"
   assertion--what standard do you refer to?
The "must be split" paragraph is odd.  One would ordinarily define a
   record length suitable to contain all the data fields of a record.
   One would not define a too-small record length that would then
   require splitting a logical record's fields across several physical
   records.
The paragraphs under heading "Record Length Information" use "EBCDIC
   file" to mean "fixed-length-record file", which is consistent with
   the preceding paragraphs, but now even more misleading, since the
   non-fixed-length formats *do* have block structure / record length
   information embedded in the file.
As long as "it is the programmer's responsibility" is there to imply
   complication, you might go on to say that, if the programmer says
   nothing in his program and in his JCL about record characteristics,
   the system will merge in that information from the DSCB (file label),
   and the access methods will handle blocking transparently to the
   user's program.  So straightforward cases are simple.  The programmer
   would have to go out of his way to create the "read 60 character
   records" error.
Sorry to run on so...it was the non-comma in the ASCII chart that got me
   started.  :)

-- 
ib1 at teksavvy dot com  RSA/2048/476766B1 (E5A329D8 DC15385D 79B174E2 9BAB4638)