Obviously computers deal with numbers. But other types of data are encoded inside a computer's memory and processor circuits: sounds, colors, temperature levels, pictures, you name it. And all of these data are encoded using just 1s and 0s! In this chapter we learn the fundamentals of integer data representation and manipulation.
All digital computers use a variety of coding schemes to encode information, but almost all of these, except maybe for bitmapped pictures, depend upon storing numbers, usually integers. One example of a code that relies on integers is ASCII, the American Standard Code for Information Interchange which encodes characters. On page 2 is the full ASCII code for values from 0 to 127. The values 128 to 256 are reserved for special graphics characters.
To translate, just find and replace all letters with their coded sequence, using either the raw binary bit patterns, or the decimal number that corresponds to those patterns. For example, "Java!" would be:
J a v a ! 74 97 118 97 33 01001010 01100001 01110110 01100001 00100001
ASCII is just one of many codes developed for representing character sets; however it is so widely used that it is almost unthinkable to build a computer using any other code now. EBCDIC is another code that IBM mainframe computers use. EBCDIC stands for Extended Binary Coded Decimal Interchange Code. Other computer manufacturers had their own codes in the early days, such as CDC (Control Data Corporation) which had a 6-bit code.
A new code is becoming quite popular -- Unicode, which is an internationalized form of ASCII. All of the ASCII code is contained in Unicode. Since there are 16 bits in a single Unicode character, the total number of possible characters is 65,536. This allows other languages to include the special symbols for their alphabets.
ASCII, EBCDIC and other codes use a fixed sized for their codewords. Today's ASCII uses 8 bits per codeword, although its predecessor used only 7 bits. EBCDIC always was a 8-bit code. A codeword is the smallest chunk of information that is encoded using one of these systems, usually a single character or digit. Codewords are not broken up nor do they have any internal structure.
Here is the complete ASCII character set for the first 128 characters.
ASCII Decimal Binary ASCII Decimal Binary ----------------------------------------------------------------------- NUL 0 0000 0000 @ 64 0100 0000 SOH 1 0000 0001 A 65 0100 0001 STX 2 0000 0010 B 66 0100 0010 ETX 3 0000 0011 C 67 0100 0011 EQT 4 0000 0100 D 68 0100 0100 EHQ 5 0000 0101 E 69 0100 0101 ACK 6 0000 0110 F 70 0100 0110 BEL 7 0000 0111 G 71 0100 0111 BS 8 0000 1000 H 72 0100 1000 HT 9 0000 1001 I 73 0100 1001 LF 10 0000 1010 J 74 0100 1010 VT 11 0000 1011 K 75 0100 1011 FF 12 0000 1100 L 76 0100 1100 CR 13 0000 1101 M 77 0100 1101 SQ 14 0000 1110 N 78 0100 1110 SI 15 0000 1111 O 79 0100 1111 DLE 16 0001 0000 P 80 0101 0000 DC1 17 0001 0001 Q 81 0101 0001 DC2 18 0001 0010 R 82 0101 0010 DC3 19 0001 0011 S 83 0101 0011 DC4 20 0001 0100 T 84 0101 0100 NAK 21 0001 0101 U 85 0101 0101 SYN 22 0001 0110 V 86 0101 0110 ETB 23 0001 0111 W 87 0101 0111 QAN 24 0001 1000 X 88 0101 1000 EM 25 0001 1001 Y 89 0101 1001 SUB 26 0001 1010 Z 90 0101 1010 ESC 27 0001 1011 [ 91 0101 1011 FS 28 0001 1100 \ 92 0101 1100 GS 29 0001 1101 ] 93 0101 1101 RS 30 0001 1110 ^ 94 0101 1110 US 31 0001 1111 _ 95 0101 1111 SPACE 32 0010 0000 ' 96 0110 0000 ! 33 0010 0001 a 97 0110 0001 " 34 0010 0010 b 98 0110 0010 # 35 0010 0011 c 99 0110 0011 $ 36 0010 0100 d 100 0110 0100 % 37 0010 0101 e 101 0110 0101 & 38 0010 0110 f 102 0110 0110 ' 39 0010 0111 g 103 0110 0111 ( 40 0010 1000 h 104 0110 1000 ) 41 0010 1001 i 105 0110 1001 * 42 0010 1010 j 106 0110 1010 + 43 0010 1011 k 107 0110 1011 , 44 0010 1100 l 108 0110 1100 - 45 0010 1101 m 109 0110 1101 . 46 0010 1110 n 110 0110 1110 / 47 0010 1111 o 111 0110 1111 0 48 0011 0000 p 112 0111 0000 1 49 0011 0001 q 113 0111 0001 2 50 0011 0010 r 114 0111 0010 3 51 0011 0011 s 115 0111 0011 4 52 0011 0100 t 116 0111 0100 5 53 0011 0101 u 117 0111 0101 6 54 0011 0110 v 118 0111 0110 7 55 0011 0111 w 119 0111 0111 8 56 0011 1000 x 120 0111 1000 9 57 0011 1001 y 121 0111 1001 : 58 0011 1010 z 122 0111 1010 ; 59 0011 1011 { 123 0111 1011 < 60 0011 1100 | 124 0111 1100 = 61 0011 1101 } 125 0111 1101 > 62 0011 1110 ~ 126 0111 1110 ? 63 0011 1111 DEL 127 0111 1111
In today's computers, 8-bit ASCII is used very widely, and since most computer manufacturers treat the 8-bit byte as the smallest addressable unit of memory, bytes and codewords are virtually synonymous. The measure of a piece of text in characters and bytes will be the same. For example, a standard page usually has around 80 characters per line (including blanks which are valid characters), and there are usually 66 lines per page. This gives 80×66 = 5280 characters per page. Since each character requires one byte, this would be 5280 bytes, or about 5.2 Kilobytes. (Remember that a kilobyte is 1024 bytes, so 5280÷1024 gives 5.15625, which, rounded, is about 5.2K.)
There is nothing magical about how ASCII or EBCDIC were developed. Somebody just assigned small integers to the printing symbols, such as "A" or "%". A logical choice would have been to assign 1 to A, 2 to B, and so forth, then 27 to the other characters that are not in the alphabet but are nonetheless important, such as ?, * and @.
Before the 8-bit byte became standard, there was no consensus on how large codewords should be. Using only capital letters of the English alphabet and digits, we get 26+10=36 codewords. Since 36 is not a power of 2, 6 bits would be needed anyway, so the remaining unused 6-bit patterns should be made useful. Various punctuation marks were assigned, such as ?, @, ", ', etc. 26 is 64, which means that there would 28 additional symbols in the coded alphabet (64-36=28). This is still not enough to include all the punctuation symbols commonly found on modern keyboards, so some things were left out, causing for hilarious consequences. For example, IBM 026 keypunch machines did not have < or > signs, so FORTRAN, developed in 1957, used .LT. and .GT., and still does to this day.
Expanding the code to 7 bits yielded 128 combinations. Now both uppercase and lowercase letters could be used, a big leap forward! Moreover, special control characters crept into the system. One of these is kind of obvious but strange nonetheless, the space or blank. It is the only character code which instructs printers and monitors not to display anything! Other control characters, such STX, EOT, BEL, and NUL had meanings specific to terminals and modems. When these I/O devices received these characters from the computer, they did various things, such as sound the "bell" (make a beep) or go into reverse video display mode. Later, networking systems used these same characters to signal the beginning and ending of a transmission, or to disconnect. ASCII and EBCDIC both have a plethora of these non-printing characters, and they can cause problems when inadvertently sent to a modem or a printer, making it hang up or print weird junk.
Though the assignment of characters to their codes could be totally arbitrary, certain assignments make it much easier to program certain things. For instance, all codes use an alphabetic ordering on letters. That is, the code for B is always numerically greater than A, the code for C is greater than B, and so forth. This makes it easier to sort on names and addresses than if the codes were arbitrary. However, which comes first in the code, lower case or upper case letters? And what is the ordering of the punctuation? There is no obvious preferred ordering so the codes diverge on this. They do, however, impose an ordering on the printable digits that mirrors their numerical relationships, so '1' is greater than '0', and so forth. Moreover, the code for '1' is exactly 1 greater than that for '0'. In ASCII '0' is 48, '1' is 49, '2' is 50 and so on.
An oddity exists in the EBCDIC code -- there is a break in the letter codes between 'j' and 'k' and again between 'r' and 's'. A similar break happens in the upper case region. Though the ordering is still in effect, the code for 'k' is not just 1 greater than that for 'j' and there are codes assigned to punctuation in between! This makes programming a bit inconvenient. For example, a C programmer can't just say:
if (ch >= 'a' || ch <= 'z') { ....
and expect this to work on an IBM mainframe due to the fact that there are other codes in between some of the letters. But like all seemingly irrational features of the world, this one harkens back to a historical consequence of the way the letter codes were arranged on old IBM punch cards. There were only 13 rows on punch cards: 10 for digits and 3 for special codes. When the arrangement of punched holes was translated into binary, the numbers came out to be noncontiguous -- there were gaps in between -- and so the odd holes in the EBCDIC code exist in order that the code can be compatible with ancient keypunches and cards!