As mentioned previously, many I/O transfers proceed in blocks; whole chunks of data are read in from the device and stored in contiguous memory addresses. If interrupt-driven I/O is used, each byte will cause an interrupt. All interrupts cause the CPU to execute an interrupt handler routine, which may be short or long in terms of the number of instructions executed. But even if the handler routine is short, each interrupt causes the CPU to do a number of instructions.
Pushing back the time to bother the CPU to the very end of the block transfer results in much better use of the CPU. It only has to do the interrupt handler routine once when the block transfer is done, as opposed to every byte or word within the block. Devices which allow this to happen smoothly are called DMA controllers, and DMA stands for Direct Memory Access, so called because the controller can directly access real memory without having to go through the CPU.
Fig. 1 shows the placement of the DMA controller on the main system bus, along with its associated peripheral. Don't think that the DMA controller and the peripheral are separate devices all the time, although they might be. Some DMA controllers can multiplex many peripherals at once, although high speed peripherals like video monitors and hard disk drives require dedicated DMA controllers. Another simplification in Fig. 1 is that usually there is still a peripheral controller that is interposed between the raw device and the various buses.
Fig. 1 shows that the controller is fully connected to the system bus, which is sometimes called the memory bus to distinguish it from other specialized buses, which we will see later. However, the peripheral is connected only to the data bus. Special wires that run between the DMA controller and the peripheral (actually its controller) are used to tell the peripheral when it can gate its data onto the data bus.
In Fig. 2 the internals of the DMA controller are shown. There are several registers which are used to keep track of the progress of a block transfer, including a word count and an address register.
The way that this works is that the CPU grabs control of the memory bus (using arbitration) and it writes a command into the DMA controller's command register by putting the value on the data bus and the memory address into which the DMA controller's command register is mapped. The usual device select decoder, not shown in Fig. 2, sees that this is meant for the DMA controller, so it copies the value into the appropriate register. The CPU also writes a word count and starting address. Then it issues a start command.
The DMA controller is not smart. In fact, it really only "runs" one program, which is in essence a for loop that allows a block copy. When it starts, it grabs control of the bus and gates the value in its address register onto the address bus. The it issues "go ahead" to the peripheral which gates its data value directly onto the data bus. If the transfer is from peripheral to memory, the DMA controller asserts WR on the control bus to tell memory to write the value on the data bus into the addressed word. If the transfer is the other way, from memory to the device, as it would be with output devices such as monitors, printers and speakers, the memory sees RD high and it will fetch the addressed word and put that on the data bus. The peripheral then absorbs the value from the data bus into its internal circuitry and copies it to the appropriate hardware.
When the peripheral is done, it asserts a signal directly to the DMA controller, such as "data ready." The DMA controller, in the meantime, has decremented the word count register and compared it to 0. If it is not 0, it increments the address register and gates that onto the address bus and the process repeats itself. If the word count is 0, the DMA controller issues an interrupt to the CPU to alert it that the transfer is done, and it then puts itself into a quiescent mode, awaiting future commands.
Just to clarify the difference between this method and the previous method, in DMA transfers the peripheral communicates directly to main memory, bypassing the CPU entirely. In the previous scheme, the peripheral communicated only with the CPU which then had to issue a separate memory command.
In order for DMA to work well, the CPU must have something else to do while the transfer is occurring, such as work on another user's program or do system maintenance. However, if the CPU is executing instructions at the same time in which a DMA transfer is taking place, it will be contending with the peripheral for the system bus. Every CPU instruction must be fetched from memory and operands, too. Results must then be stored back to memory. It seems like the memory and the main system bus will become a bottleneck and DMA will not work.
However, peripherals are often much slower than the CPU, so they rarely saturate the main bus. If there are too many peripherals hanging off the bus, and they are all simultaneously active, there can be problems, but this seldom happens on personal computers and mainframes use much faster buses and even more than one bus. Some high speed peripherals like fast hard disk drives need to grab control of the main bus for a stretch of time, so their DMA controllers assert a control line telling the CPU to wait until the transfer is done. However, such transfers are relatively rare in a program, so the user is not likely to see a terrible slowdown. (Of course, relatively rare doesn't mean that it might not happen many times every minute, but in relation to the billions of CPU instructions that are performed in between, it seems relatively rare.)
A more common situation is for the DMA controller to grab control of the bus in behalf of the peripheral, causing the memory to wait for one memory cycle, and allow its peripheral to proceed. Then the bus is released for the CPU's use, or for other peripherals, while the peripheral continues with its task of getting (or receiving) the next byte. This method is called cycle stealing because the DMA/peripheral steal a memory cycle from the CPU. The CPU can always afford to wait a bit to continue with its instruction processing because nothing in the outside world is really dependent on it. With the case of a tape or a hard disk, the physical medium is moving at a constant speed and cannot be slowed down or stopped, due to inertia. If the data flies by without being caught it is just gone. Not so with the CPU, which can stop electrons dead in their tracks for a while (electrons are extremely light and have little inertia.) Thus priority is always given to DMA devices and the CPU always allows its memory cycles to be stolen.
Another way in which DMA's impact on the CPU can be lessened is by caching the CPU's instructions in a separate cache, either on the processor chip or on a separate chip which is attached to the processor through a private local bus. In fact, both data and program instructions are cached, and if the hit ratio can be kept relatively high, the CPU will need to hit main memory very infrequently, leaving the main bus free for the I/O devices.
As can be seen, all the techniques we have studied fit together into one huge, fast system. Caching is beneficial to all concerned; DMA is handy and economical, and all sorts of fine tuning comes into play. For instance, a single memory read may result in not just one byte, but 4 or even 8 bytes being transferred over the main bus. This cuts down on accesses to the main system bus drastically. The peripheral's controller can be set up to hold 3 bytes as the actual device creates them, and then grab the data bus only when the 4th byte is ready, which would decrease the time that the peripheral needs with the system bus by a factor of 4. The only drawback is increased expense because wider buses have more physical wires and cost more.