Section 6.1
Data Representation

Obviously computers deal with numbers. But other types of data are encoded inside a computer's memory and processor circuits: sounds, colors, temperature levels, pictures, you name it. And all of these data are encoded using just 1s and 0s! In this chapter we learn the fundamentals of integer data representation and manipulation.

All digital computers use a variety of coding schemes to encode information, but almost all of these, except maybe for bitmapped pictures, depend upon storing numbers, usually integers. One example of a code that relies on integers is ASCII, the American Standard Code for Information Interchange which encodes characters. On page 2 is the full ASCII code for values from 0 to 127. The values 128 to 256 are reserved for special graphics characters.

To translate, just find and replace all letters with their coded sequence, using either the raw binary bit patterns, or the decimal number that corresponds to those patterns. For example, "Java!" would be:

   J           a           v           a           !
   74          97          118         97          33
   01001010    01100001    01110110    01100001    00100001

ASCII is just one of many codes developed for representing character sets; however it is so widely used that it is almost unthinkable to build a computer using any other code now. EBCDIC is another code that IBM mainframe computers use. EBCDIC stands for Extended Binary Coded Decimal Interchange Code. Other computer manufacturers had their own codes in the early days, such as CDC (Control Data Corporation) which had a 6-bit code.

A new code is becoming quite popular -- Unicode, which is an internationalized form of ASCII. All of the ASCII code is contained in Unicode. Since there are 16 bits in a single Unicode character, the total number of possible characters is 65,536. This allows other languages to include the special symbols for their alphabets.

ASCII, EBCDIC and other codes use a fixed sized for their codewords. Today's ASCII uses 8 bits per codeword, although its predecessor used only 7 bits. EBCDIC always was a 8-bit code. A codeword is the smallest chunk of information that is encoded using one of these systems, usually a single character or digit. Codewords are not broken up nor do they have any internal structure.

Here is the complete ASCII character set for the first 128 characters.

ASCII     Decimal       Binary       ASCII     Decimal         Binary
-----------------------------------------------------------------------
 NUL        0          0000 0000        @          64         0100 0000
 SOH        1          0000 0001        A          65         0100 0001
 STX        2          0000 0010        B          66         0100 0010
 ETX        3          0000 0011        C          67         0100 0011
 EQT        4          0000 0100        D          68         0100 0100
 EHQ        5          0000 0101        E          69         0100 0101
 ACK        6          0000 0110        F          70         0100 0110
 BEL        7          0000 0111        G          71         0100 0111
 BS         8          0000 1000        H          72         0100 1000
 HT         9          0000 1001        I          73         0100 1001
 LF        10          0000 1010        J          74         0100 1010
 VT        11          0000 1011        K          75         0100 1011
 FF        12          0000 1100        L          76         0100 1100
 CR        13          0000 1101        M          77         0100 1101
 SQ        14          0000 1110        N          78         0100 1110
 SI        15          0000 1111        O          79         0100 1111
 DLE       16          0001 0000        P          80         0101 0000
 DC1       17          0001 0001        Q          81         0101 0001
 DC2       18          0001 0010        R          82         0101 0010
 DC3       19          0001 0011        S          83         0101 0011
 DC4       20          0001 0100        T          84         0101 0100
 NAK       21          0001 0101        U          85         0101 0101
 SYN       22          0001 0110        V          86         0101 0110
 ETB       23          0001 0111        W          87         0101 0111
 QAN       24          0001 1000        X          88         0101 1000
 EM        25          0001 1001        Y          89         0101 1001
 SUB       26          0001 1010        Z          90         0101 1010
 ESC       27          0001 1011        [          91         0101 1011
 FS        28          0001 1100        \          92         0101 1100
 GS        29          0001 1101        ]          93         0101 1101
 RS        30          0001 1110        ^          94         0101 1110
 US        31          0001 1111        _          95         0101 1111
 SPACE     32          0010 0000        '          96         0110 0000
 !         33          0010 0001        a          97         0110 0001
 "         34          0010 0010        b          98         0110 0010
 #         35          0010 0011        c          99         0110 0011
 $         36          0010 0100        d         100         0110 0100
 %         37          0010 0101        e         101         0110 0101
 &         38          0010 0110        f         102         0110 0110
 '         39          0010 0111        g         103         0110 0111
 (         40          0010 1000        h         104         0110 1000
 )         41          0010 1001        i         105         0110 1001
 *         42          0010 1010        j         106         0110 1010
 +         43          0010 1011        k         107         0110 1011
 ,         44          0010 1100        l         108         0110 1100
 -         45          0010 1101        m         109         0110 1101
 .         46          0010 1110        n         110         0110 1110
 /         47          0010 1111        o         111         0110 1111
 0         48          0011 0000        p         112         0111 0000
 1         49          0011 0001        q         113         0111 0001
 2         50          0011 0010        r         114         0111 0010
 3         51          0011 0011        s         115         0111 0011
 4         52          0011 0100        t         116         0111 0100
 5         53          0011 0101        u         117         0111 0101
 6         54          0011 0110        v         118         0111 0110
 7         55          0011 0111        w         119         0111 0111
 8         56          0011 1000        x         120         0111 1000
 9         57          0011 1001        y         121         0111 1001
 :         58          0011 1010        z         122         0111 1010
 ;         59          0011 1011        {         123         0111 1011
 <         60          0011 1100        |         124         0111 1100
 =         61          0011 1101        }         125         0111 1101
 >         62          0011 1110        ~         126         0111 1110
 ?         63          0011 1111        DEL       127         0111 1111

In today's computers, 8-bit ASCII is used very widely, and since most computer manufacturers treat the 8-bit byte as the smallest addressable unit of memory, bytes and codewords are virtually synonymous. The measure of a piece of text in characters and bytes will be the same. For example, a standard page usually has around 80 characters per line (including blanks which are valid characters), and there are usually 66 lines per page. This gives 80×66 = 5280 characters per page. Since each character requires one byte, this would be 5280 bytes, or about 5.2 Kilobytes. (Remember that a kilobyte is 1024 bytes, so 5280÷1024 gives 5.15625, which, rounded, is about 5.2K.)

There is nothing magical about how ASCII or EBCDIC were developed. Somebody just assigned small integers to the printing symbols, such as "A" or "%". A logical choice would have been to assign 1 to A, 2 to B, and so forth, then 27 to the other characters that are not in the alphabet but are nonetheless important, such as ?, * and @.

Before the 8-bit byte became standard, there was no consensus on how large codewords should be. Using only capital letters of the English alphabet and digits, we get 26+10=36 codewords. Since 36 is not a power of 2, 6 bits would be needed anyway, so the remaining unused 6-bit patterns should be made useful. Various punctuation marks were assigned, such as ?, @, ", ', etc. 26 is 64, which means that there would 28 additional symbols in the coded alphabet (64-36=28). This is still not enough to include all the punctuation symbols commonly found on modern keyboards, so some things were left out, causing for hilarious consequences. For example, IBM 026 keypunch machines did not have < or > signs, so FORTRAN, developed in 1957, used .LT. and .GT., and still does to this day.

Expanding the code to 7 bits yielded 128 combinations. Now both uppercase and lowercase letters could be used, a big leap forward! Moreover, special control characters crept into the system. One of these is kind of obvious but strange nonetheless, the space or blank. It is the only character code which instructs printers and monitors not to display anything! Other control characters, such STX, EOT, BEL, and NUL had meanings specific to terminals and modems. When these I/O devices received these characters from the computer, they did various things, such as sound the "bell" (make a beep) or go into reverse video display mode. Later, networking systems used these same characters to signal the beginning and ending of a transmission, or to disconnect. ASCII and EBCDIC both have a plethora of these non-printing characters, and they can cause problems when inadvertently sent to a modem or a printer, making it hang up or print weird junk.

Though the assignment of characters to their codes could be totally arbitrary, certain assignments make it much easier to program certain things. For instance, all codes use an alphabetic ordering on letters. That is, the code for B is always numerically greater than A, the code for C is greater than B, and so forth. This makes it easier to sort on names and addresses than if the codes were arbitrary. However, which comes first in the code, lower case or upper case letters? And what is the ordering of the punctuation? There is no obvious preferred ordering so the codes diverge on this. They do, however, impose an ordering on the printable digits that mirrors their numerical relationships, so '1' is greater than '0', and so forth. Moreover, the code for '1' is exactly 1 greater than that for '0'. In ASCII '0' is 48, '1' is 49, '2' is 50 and so on.

An oddity exists in the EBCDIC code -- there is a break in the letter codes between 'j' and 'k' and again between 'r' and 's'. A similar break happens in the upper case region. Though the ordering is still in effect, the code for 'k' is not just 1 greater than that for 'j' and there are codes assigned to punctuation in between! This makes programming a bit inconvenient. For example, a C programmer can't just say:

if (ch >= 'a' || ch <= 'z') { ....

and expect this to work on an IBM mainframe due to the fact that there are other codes in between some of the letters. But like all seemingly irrational features of the world, this one harkens back to a historical consequence of the way the letter codes were arranged on old IBM punch cards. There were only 13 rows on punch cards: 10 for digits and 3 for special codes. When the arrangement of punched holes was translated into binary, the numbers came out to be noncontiguous -- there were gaps in between -- and so the odd holes in the EBCDIC code exist in order that the code can be compatible with ancient keypunches and cards!