ECE 385 Structure Lecture

This lecture covers:

Structure of this Course

The importance of Linux and open science has been covered in the previous lecture. By now, you should have purchased or found an x86 computer which is usable in this course, and have attempted to install Linux onto the computer. You should be in the process of buying parts (parallel port cables, LEDs, etc) which are necessary for the completion of the first lab.

Much of this course is about using micro-controllers to control other devices. At this point in your computer science student career, you have undoubtedly had a fair bit of experience programming. It is less likely that you have had a chance to use a computer to control other devices (like the LEDs in lab 0). This course will teach you how to do just that. We will also be writing simple device drivers which expand the kernel's ability to communicate with devices you build.

Embedded Systems

What you may come to realize after the first lab, is that the task of turning on LEDs in many conditions is a very simple task. To use a full computer with an operating system, fast processing, plenty of storage, etc., is unnecessary. We may want to slim down our resources to deal with various problems without so much overhead and waste.

Embedded computing tailors the environment for a specific problem. In this course, we will be dealing extensively with AVR micro-controllers, produced by Atmel, and programming them on their STK500 development boards. Though the development boards are quite small, the AVR programs will still need to be developed on a host computer, and downloaded to the AVR. Note that the computers you are using for the first three labs will be well suited for this task. However, if you happen to have a smaller computer (like a laptop or wearable computer) which is running Debian Linux, you may prefer to use this computer. The combination of a laptop which the STK500 would be a very portable and practical working environment. Just be sure that the computer has at least one serial port, as the STK500 uses a serial port for the programming of the AVR.

x86 Architecture

This course will start out using the x86 architecture rather than the AVR micro-controllers. The AVRs are easier to understand, however they must be programmed using an external host, and as this host is likely to be an x86 computer. As the x86 computer is necessary for this process, the course will begin just using the x86 computer.

A (not too short) history of relevant microprocessors

Much of the following section was taken from an excellent web page written by John Bayko (john.bayko@sasktel.net). The web page may be found at http://www3.sk.sympatico.ca/jbayko/cpu.html

The Intel 4004, the first (Nov 1971) . .

The first single chip CPU was the Intel 4004, a 4-bit processor meant for a calculator. It processed data in 4 bits, but its instructions were 8 bits long. Program and Data memory were separate, 1K data memory and a 12-bit PC for 4K program memory (in the form of a 4 level stack, used for CALL and RET instructions). There were also sixteen 4-bit (or eight 8-bit) general purpose registers.

The 4004 had 46 instructions, using only 2,300 transistors in a 16-pin DIP. It ran at a clock rate of 740kHz (eight clock cycles per CPU cycle of 10.8 microseconds) - the original goal was 1MHz, to allow it to compute BCD arithmetic as fast (per digit) as a 1960's era IBM 1620.

The 4040 (1972) was an enhanced version of the 4004, adding 14 instructions, larger (8 level) stack, 8K program space, and interrupt abilities (including shadows of the first 8 registers). Should Pioneer 10 and Pioneer 11 ever be found by an extraterrestrial species, the 4004 will represent an example of Earth's technology.

[for additional information, see Appendix E]


. . Some have suggested that the MP944 digital processor used for the F-14 Tomcat aircraft of the U.S Navy qualifies as the "first microprocessor". Although interesting, it was not a single-chip processor, and was not general purpose - it was more like a set of parallel building blocks you could use to make a special purpose digital signal processor from (in the form of one or more data pipelines in parallel). It's only included here because at least two people asked me about it.

It was bit serial to reduce connections between chips, with highly parallel design and high clock rate to compensate. Words were 20 bits (required by the precision of the sensor and control values) and ALU units could perform operations on input bits as they were read in, while bits of the previous result was read out. "Steering Logic" (SL) units switched input signals to output lines (and added or subtracted, if two inputs went to one output), which could be directed to multiplication, division, and special logic units (which acted a little like a Transfer Triggered Architecture). Bits read serially from the ROMS (eight banks with 128 20-bit words, each with its own program counter) directed the data movement and unit operations, but had to be synchronized with data movement making programming difficult (basically microcode). RAM (called Random Access Storage) consisted of units with sixteen 20-bit words. Programming consisted of using the SLs to direct instruction and data words to the function units, which could be hooked to other function units in a pipeline, along with other pipelines in parallel. A separate set of eight ROMs could be used for data.

It took until 1998 to declassify a paper on the 1970 design. Although impressively elegant, it probably didn't warrant that length of secrecy.


Intel Corporation:
http://www.intel.com/
Intel 25th Anniversary of the Microprocessor:
http://www.intel.com/intel/museum/25anniv/index.htm

F14 Links:
http://www.microcomputerhistory.com/f14patterson.htm


Part II: TMS 1000, First microcontroller (1974) .

Texas Instruments followed the Intel 4004/4040 closely with the 4-bit TMS 1000, which was the first microprocessor to include enough RAM, and space for a program ROM, and I/O support on a single chip to allow it to operate without multiple external support chips, making it the first microcontroller. It also featured an innovative feature to add custom instructions to the CPU.

It included a 4-bit accumulator, 4-bit Y register and 2 or 3-bit X register, which combined to create a 6 or 7 bit index register for the 64 or 128 nibbles of on chip RAM. A 1-bit status register was used for various purposes in different contexts. The 6-bit PC combined with a 4 bit page register and an optional 1 bit bank ('chapter') register to produce 10 or 11 address bits to 1KB or 2KB of on-chip program ROM. There was also a 6-bit subroutine return register and 4-bit page buffer, used as the destination on a branch, or exchanged with the PC and page registers for a subroutine (amounting to a 1-element stack, branches could not be performed within a subroutine).

An interesting feature of the PC is it was incremented using a feedback shift register, not a counter, so instructions were not consecutive in memory, but since all memory was internal, this was not a problem. Instructions were 8 bits with twelve hardwired, and with a 31X16 element PLA allowing 31 custom microprogrammed instructions. All hardwired instructions were single cycle, and no interrupts were allowed.

It gained fame in the movie "ET: The Extraterrestrial" as the brains in the Texas Instruments "Speak and Spell" educational toy.


Texas Instruments:
http://www.ti.com/
TMS 1000 One-Chip Microcomputers:
http://www.ti.com/corp/docs/history/tms.htm

Part III: The Intel 8080 (April 1974) . . .

The 8080 was the successor to the 8008 (April 1972, intended as a terminal controller, and similar to the 4040). While the 8008 had 14 bit PC and addressing, the 8080 had a 16 bit address bus and an 8 bit data bus. Internally it had seven 8 bit registers (A-E, H, L - pairs BC, DE and HL could be combined as 16 bit registers), a 16 bit stack pointer to memory which replaced the 8 level internal stack of the 8008, and a 16 bit program counter. It also had several I/O ports - 256 of them, so I/O devices could be hooked up without taking away or interfering with the addressing space, and a signal pin that allowed the stack to occupy a separate bank of memory.

The 8080 was used in the Altair 8800, the first widely-known personal computer (though the definition of 'first PC' is fuzzy. Some claim that the 12-bit LINC (Laboratory INstruments Computer) was the first 'personal computer'. Developed at MIT (Lincoln Labs) in 1963 using DEC components, it inspired DEC to design its own PDP-8 in 1965, also considered an early 'personal computer'). 'Home computer' would probably be a better term here, though).

Intel updated the design with the 8085 (1976), which added two instructions to enable/disable three added interrupt pins (and the serial I/O pins), and simplified hardware by only using +5V power, and adding clock generator and bus controller circuits on-chip.


Intel Corporation:
http://www.intel.com/
Intel 25th Anniversary of the Microprocessor:
http://www.intel.com/intel/museum/25anniv/index.htm

Laboratory Instrument Computer (LINC) - Exhibits and Galleries:
http://www.nih.gov/od/museum/exhibits/linc/

MITS/Pertec Altair 8800/680b/MITS 300
http://exo.com/~wts/wts10005.HTM


Part IV: The Zilog Z-80 - End of an 8-bit line (July 1976) . . . .

The Z-80 was intended to be an improved 8080 (designed by ex-Intel engineers), and it was - vastly improved. It also used 8 bit data and 16 bit addressing, and could execute all of the 8080 (but not 8085) op codes, but included 80 more, instructions (1, 4, 8 and 16 bit operations and even block move and block I/O). The register set was doubled, with two banks of data registers (including A and F) that could be switched between. This allowed fast operating system or interrupt context switches. The Z-80 also added two index registers (IX and IY) and 2 types of relocatable vectored interrupts (direct or via the 8-bit I register).

Clock speeds ranged from the original Z-80 2.5MHz to the Z80-H (later called Z80-C) at 8MHz, and later a CMOS version at 10MHz.

Like many processors (including the 8085), the Z-80 featured many undocumented instructions. In some cases, they were a by-product of early designs (which did not trap invalid op codes, but tried to interpret them as best they could), and in other cases chip area near the edge was used for added instructions, but fabrication made the failure rate high. Instructions that often failed were just not documented, increasing chip yield. Later fabrication made these more reliable.

But the thing that really made the Z-80 popular in designs was the memory interface - the CPU generated its own RAM refresh signals, which meant easier design and lower system cost, the deciding factor in its selection for the TRS-80 Model 1. That and its 8080 compatibility, and CP/M, the first standard microprocessor operating system, made it the first choice of many systems.

Embedded variants of the Z-80 were also produced. Hitachi produced the 64180 (1984) with added components (two 16 bit timers, two DMA controllers, three serial ports, and a segmented MMU mapping a 20 bit (1M) address space to any three variable sized segments in the 16 bit (64K) Z-80 memory map), a design Zilog and Hitachi later refined to produce the Z-180 and HD64180Z (1987?) which were compatible with Z-80 peripheral chips, plus variants (Z-181, Z-182). The Z-280 was a 16 bit version introduced about July, 1987 (loosely based on the ill-fated Z-800), with a paged (like Z-180) 24 bit (16M) MMU (8 or 16 bit bus resizing), user/supervisor modes and features for multitasking, a 256 byte (4-way) cache, 4 channel DMA, and a huge number of new op codes tacked on (total of almost 3,500, including previously undocumented Z-80 instructions), though the size made some very slow. Internal clock could be run at twice the external clock (ex. 16MHz CPU with a 8MHz bus), and additional on-chip components were available. A 16/32 bit Z-380 version also exists (1994) with added 32-bit linear addressing mode (16-bit mode is Z-80 and Z-180 binary compatible, but not Z-280 compatible).

Rabbit Semiconductor's Rabbit 2000 (1999/2000?) with a Z-80 derived instruction set which drops some instructions (mostly I/O, some less useful instructions), and adds others (16-bit data, computed address). It also drops dynamic RAM support, because embedded systems more often use static RAM, and adds serial, parallel, and inter-processor communication units. Program space is extended to 20 bits using an 8-bit page register, rather than the Z-180's MMU.

The Z-8 (1979) was an embedded processor with on-chip RAM (actually a set of 124 general and 20 special purpose registers) and ROM (often a BASIC interpreter), and is available in a variety of custom configurations up to 20MHz. Not actually related to the Z-80.


Zilog Corporation:
http://www.zilog.com/
Rabbit Semiconductor:
http://www.rabbitsemiconductor.com/
Rabbit 2000 Microprocessor:
http://www.rabbitsemiconductor.com/documentation/docs/RABBITMAN/rabbit.htm
Retrocomputing TRS-80's:
http://www.simology.com/smccoy/trs.html

Part V: The 650x, Another Direction (1975) . . .

Shortly after Intel's 8080, Motorola introduced the 6800. Some of the designers (notably Chuck Peddle) left to start MOS Technologies (later bought by Commodore), which introduced the 650x series which included the 6501 (pin compatible with the 6800, taken off the market almost immediately for legal reasons) and the 6502 (used in early Commodores, Apples and Ataris). Like the 6800 series, variants were produced which added features like I/O ports (6510 in the Commodore 64) or reduced costs with smaller address buses (6507 13-bit 8K address bus in the Atari 2600). The 650x was little endian (lower address byte could be added to an index register while higher byte was fetched) and had a completely different instruction set from the big endian 6800. Apple designer Steve Wozniak described it as the first chip you could get for less than a hundred dollars (actually a quarter of the 6800 price) - it became the CPU of choice for many early home computers (8 bit Commodore and Atari products).

Unlike the 8080 and its kind, the 6502 (and 6800) had very few registers. It was an 8 bit processor, with 16 bit address bus. Inside was one 8 bit data register, two 8 bit index registers, and an 8 bit stack pointer (stack was preset from address 256 ($100 hex) to 511 ($1FF)). It used these index and stack registers effectively, with more addressing modes, including a fast zero-page mode that accessed memory addresses from address 0 to 255 ($FF) with an 8-bit address that speeded operations (it didn't have to fetch a second byte for the address).

Back when the 6502 was introduced, RAM was actually faster than microprocessors, so it made sense to optimize for RAM access rather than increase the number of registers on a chip. It also had a lower gate count (and cost) than its competitors.

The 650x also had undocumented instructions, including JAM, which simply causes the CPU to freeze, requiring a hardware reset or power cycle to restart.

The CMOS 65C02/65C02S fixed some original 6502 design flaws, and the 65816 (officially W65C816S, both designed by Bill Mensch of Western Design Center Inc.) extended the 650x to 16 bits internally, including index and stack registers, with a 16-bit direct page register (similar to the 6809), and 24-bit address bus (16 bit registers plus 8 bit data/program bank registers). It included an 8-bit emulation mode. Microcontroller versions of both exist, and a 32-bit version (the 65832) is planned. Various licensed versions are supplied by GTE (16 bit G65SC802 (pin compatible with 6502), and G65SC816 (support for VM, I/D cache, and multiprocessing)) and Rockwell (R65C40), and Mitsubishi has a redesigned compatible version. The 6502 remains surprisingly popular largely because of the variety of sources and support for it.

The 6502-based Apple II line (not backwards compatible with the Apple I) was among the first microcomputers introduced and became the longest running PC line, eventually including the 65816-based Apple IIgs The 6502 was also used in the Nintendo entertainment system (NES), and the 65816 was in the 16-bit successor, the Super NES, before Nintendo switched to MIPS embedded processors.


The Western Design Center, Inc.:
http://www.westerndesigncenter.com/
Apple II History Home:
http://apple2history.org/
The Secret Weapons of Commodore!:
http://www.floodgap.com/retrobits/ckb/secret/
www.6502.org
http://www.6502.org/

Part VI: The 6809, extending the 680x (1977) . . . . . . . .

Like the 6502, the 6809 was based on the Motorola 6800 (August 1974), though the 6809 expanded the design significantly. The 6809 had two 8 bit accumulators (A & B) and could combine them into a single 16 bit register (D). It also featured two index registers (X & Y) and two stack pointers (S & U), which allowed for some very advanced addressing modes (The 6800 had A & B (and D) accumulators, one index register and one stack register). The 6809 was source compatible with the 6800, even though the 6800 had 78 instructions and the 6809 only had around 59. Some instructions were replaced by more general ones which the assembler would translate, and some were even replaced by addressing modes. While the 6800 and 6502 both had a fast 8 bit mode to address the first 256 bytes of RAM, the 6809 had an 8 bit Direct Page register to locate this fast address page anywhere in the 64K address space.

Other features were one of the first multiplication instructions of the time, 16 bit arithmetic, and a special fast interrupt. But it was also highly optimized, gaining up to five times the speed of the 6800 series CPU. Like the 6800, it included the undocumented HCF (Halt Catch Fire) instruction to incrementally strobe the address lines for bus testing ("jump to accumulator (A or B)" in the 6800, implemented and documented as $00 in the 68HC11 which is described below).

The 6800 and 6809, like the 6502 series, used a single clock cycle to generate the timing for four internal execution stages by using the rising and falling edges of the base cycle (not just rising edges), and another clock 90 degrees out of phase (giving two rising and two falling edges per cycle) - this allowed instructions to execute in one external 'cycle' rather than four for most CPUs, such as the 8080, which used the external clock directly, so an equivalent instruction would take four cycles, meaning a 2MHz 6809 would be roughly equivalent to a 8MHz 8080. This is different from clock-doubling, which uses a phase-locked-loop to generate a faster internal clock (for the CPU) which is synchronized with an external clock (for the bus). Motorola later produced CPUs in this line with a standard four-cycle clock. The 680x and 650x only accessed memory every other cycle, allowing a peripheral (such as video, or even a second cpu) to access the same memory without conflict.

The 6800 lived on as well, becoming the 6801/3, which included ROM, some RAM, a serial I/O port, and other goodies on the chip (as an embedded controller, minimizing part counts - but expensive at 35,000 transistors. The 6805 was a cheaper 6801/3, dropping seldom used instructions and features). Later the 68HC11 version (two 8 bit/one 16 bit data register, two 16 bit index, and one 16 bit stack register, and an expanded instruction set with 16 bit multiply operations) was extended to 16 bits as the 68HC16 (additional 16-bit accumulator E, three index registers IX, IY, IZ, plus extension registers to add 4 bits to addresses and accumulator E for a 1M address space, plus 16-bit multiply registers HR and IR and 36-bit AM accumulator), and a lower cost 16 bit 68HC12 (May 1996). It remains a popular embedded processor (with over 2 billion 6800 variants sold), and radiation hardened versions of the 68HC11 have been used in communications satellites. But the 6809 was a very fast and flexible chip for its time, particularly with the addition of the OS-9 operating system.


As a note, Hitachi produced a version called the 6309. Compatible with the 6809, it added 2 new 8-bit registers (E and F) that could be combined to form a second 16 bit register (W), and all four 8-bit registers could form a 32 bit register (Q). It also featured hardware division, and some 32 bit arithmetic, a zero register (always 0 on read), block move, and was generally 30% faster in native mode. Also, unlike the 6809, the 6309 could trap on an illegal instruction. These enhancements, surprisingly, never appeared in official Hitachi documentation.

Motorola:
http://www.mot.com/
Motorola Micro-controllers:
http://www.mcu.motsps.com/mc.html
TRC-80 & Tandy Color Computer Homepage:
http://zeppelin.tzo.cc/coco/coco.jhtml

Part VII: Advanced Micro Devices Am2901, a few bits at a time . .

Bit slice processors were modular processors. Mostly, they consisted of an ALU of 1, 2, 4, or 8 bits, and control lines (including carry or overflow signals usually internal to the CPU). Two 4-bit ALUs could be arranged side by side, with control lines between them, to form an ALU of 8-bits, for example. A sequencer would execute a program to provide data and control signals.

The Am2901, from Advanced Micro Devices, was a popular 4-bit-slice processor. It featured sixteen 4-bit registers and a 4-bit ALU, and operation signals to allow carry/borrow or shift operations and such to operate across any number of other 2901s. An address sequencer (such as the 2910) could provide control signals with the use of custom microcode in ROM.

The Am2903 featured hardware multiply.

Legend holds that some Soviet clones of the PDP-11 were assembled from Soviet clones of the Am2901.


Since it doesn't fit anywhere else in this list, I'll mention it here...

AMD also produced what is probably the first floating point "coprocessor" for microprocessors, the AMD 9511 "arithmetic circuit" (1979), which performed 32 bit (23 + 7 bit floating point) RPN-style operations (4 element stack) under CPU control - the 64-bit 9512 (1980) lacked the transcendental functions. It was based on a 16-bit ALU, performed add, subtract, multiply, and divide (plus sine and cosine), and while faster than software on microprocessors of the time (about 4X speedup over a 4MHz Z-80), it was much slower (at 200+ cycles for 32*32->32 bit multiply) than more modern math coprocessors are.

It was used in some CP/M (Z-80) systems (I heard it was used on an S-100 bus math card for NorthStar systems, but that was in fact used a 74181 BCD (Binary Coded Decimal) ALU, and ten PROM chips for microcode). Calculator circuits (such as the National Semiconductor MM57109 (1980), actually a 4-bit NS COP400 processor with floating point routines in ROM) were also sometimes used, with emulated keypresses sent to it and results read back, to simplify programming rather than for speed.


Contents - Bit Slice Design - written using AMD's 2900 Series - D.E.White:
http://www.dacafe.com/DACafe/EDATools/EDAbooks/BitSlice/bitslcC.html

Part VIII: Intel 8051, Descendant of the 8048 (around 1977?). . . .

Initially similar to the Fairchild F8, the Intel 8048 was also designed as a microcontroller rather than a microprocessor - low cost and small size was the main goal. For this reason, data is stored on-chip, while program code is external (a true Harvard architecture, although program and data use the same address lines). The 8048 was eventually replaced by the very popular but bizarre 8051 and 8052, available with on-chip program ROMs (the 8031 version still used external ROMs).

While the 8048 used 1-byte instructions, the 8051 has a more flexible 2-byte instruction set. It has eight 8-bit registers, plus an accumulator A. Data space is 128 bytes accessed directly or indirectly by a register, plus another 128 above that in the 8052 which can only be accessed indirectly (usually for a stack). External memory occupies the same address space, and can be accessed directly (in a 256 byte page via I/O ports) or through the 16 bit DPTR address register much like in the RCA 1802. Direct data above location 32 is bit-addressable. Data and program memory share the address space (and address lines, when using external memory). Although complicated, these memory models allow flexibility in embedded designs, making the 8051 very popular (over 1 billion sold since 1988).

The Siemens 80C517 adds a math coprocessor to the CPU which provides 16 and 32 bit integer support plus basic floating point assistance (32 bit normalize and shift), reminiscent of the old AMD 9511. The Texas Instruments TMS370 is similar to the 8051, Adding a B accumulator and some 16 bit support.


Intel Corporation:
http://www.intel.com/
Embedded Intel(R) Architecture Micro-controllers:
http://developer.intel.com/design/embcontrol/


Part IX: Microchip Technology PIC 16x/17x, call it RISC (1975) . . .

The roots of the PIC originated at Harvard university (see Harvard Architecture) for a Defense Department project, but was beaten by a simpler (and more reliable at the time) single memory design from Princeton. Harvard Architecture was first used in the Signetics 8x300, and was adapted by General Instruments for use as a peripheral interface controller (PIC) which was designed to compensate for poor I/O in its 16 bit CP1600 CPU. The micro-electronics division was eventually spun off into Arizona Microchip Technology (around 1985), with the PIC as its main product.

The PIC has a large register set (from 25 to 192 8-bit registers, compared to the Z-8's 144). There are up to 31 direct registers, plus an accumulator W, though R1 to R8 also have special functions - R2 is the PC (with implicit stack (2 to 16 level)), and R5 to R8 control I/O ports. R0 is mapped to the register R4 (FSR) points to (similar to the ISAR in the F8, it's the only way to access R32 or above).

The 16x is very simple and RISC-like (but less so than the RCA 1802 or the more recent Atmel AVR microcontroller. It has only 33 fixed length 12-bit instructions, including several with a skip-on-condition flag to skip the next instruction (for loops and conditional branches), producing tight code important in embedded applications. It's marginally pipelined (2 stages - fetch and execute) - combined with single cycle execution (except for branches - 2 cycles), performance is very good for its processor category.

The 17x has more addressing modes (direct, indirect, and relative - indirect mode instructions take 2 execution cycles), more instructions (58 16-bit), more registers (232 to 454), plus up to 64K-word program space (2K to 8K on chip). The high end versions also have single cycle 8-bit unsigned multiply instructions.

The PIC 16x is an interesting look at an 8 bit design made with slightly newer design techniques than other 8 bit CPUs in this list - around 1978 by General Instruments (the 1650, a successor to the more general 1600). It lost out to more popular CPUs and was later sold to Microchip Technology, which still sells it for small embedded applications. An example of this microprocessor is a small PC board called the BASIC Stamp, consisting of 2 ICs - an 18-pin PIC 16C56 CPU (with a BASIC interpreter in 512 word ROM (yes, 512)) and 8-pin 256 byte serial EEPROM (also made by Microchip) on an I/O port where user programs (about 80 tokenized lines of BASIC) are stored.


Microchip Technology:
http://www.microchip.com/
PIC web server:
http://www-ccs.cs.umass.edu/~shri/iPic.html

Part X: Atmel AVR - RISC ridiculously small (June 1997) .

There's not much to say about the 8-bit Atmel AVR microcontroller, an attempt to bring RISC design down to 8-bit levels. It's a canonical simple load-store design - 16-bit instructions, 2-stage pipeline, thirty-two 8-bit data registers (six usable as three 16-bit X, Y, and Z address registers), load/store architecture (plus data/subroutine stack).


Atmel:
http://eu.atmel.com/

Part X: DEC PDP-11, benchmark for the first 16/32 bit generation. (1970) . . . .

The DEC PDP-11 was the most popular in the PDP (Programmed Data Processors) line of minicomputers, a successor to the previously popular PDP-8, designed in part by Gordon Bell. It remained in production until the decision to discontinue the line as of September 30, 1997 (over 25 years - see note on the DEC Alpha intended lifetime). Many of the PDP-11 features have been carried forward to newer processors because the PDP-11 was the basis for the C programming language, which became the most prolific programming language in the world (in terms of variety of applications, not number) and which includes several low level processor dependent features which were useful to replicate in newer CPUs for this reason.

The PDP-8 continued for a while in certain applications, while the PDP-10 (1967) was a higher capacity 36-bit mainframe-like system (sixteen general registers and floating point operations), much adored and rumored to have souls.

The PDP-11 had eight general purpose 16-bit registers (R0 to R7 - R6 was also the SP and R7 was the PC). It featured powerful register oriented (little-endian, byte addressable) addressing modes. Since the PC was treated as a general purpose register, constants were loaded using an indirect mode on R7 which had the effect of loading the 16 bit word following the current instruction, then incrementing the PC to the next instruction before fetching. The SP could be accessed the same way (and any register could be used for a user stack (useful for FORTH)). A CC (or PSW) register held results from every instruction that executed.

Adjacent registers could be implicitly grouped into a 32 bit register for multiply and divide results (Multiply result stored in two registers if destination is an even register, not if it's odd. Divide source must be grouped - quotient is stored in high order (low number) register, remainder in low order).

A floating point unit could be added which contains six 64 bit accumulators (AC0 to AC5, can also be used as six 32-bit registers - values can only be loaded or stored using the first four registers).

PDP-11 addresses were 16 bits, limiting program space to 64K, though an MMU could be used to expand total address space (18-bits and 22-bits in different PDP-11 versions).

The LSI-11 (1975-ish) was a popular microprocessor implementation of the PDP-11 using the Western Digital MCP1600 microprogrammable CPU, and the architecture influenced the Motorola 68000, NS 320xx, and Zilog Z-8000 microprocessors in particular. There was also a 32-bit PDP-11 plan as far back as its 1969 introduction. The PDP-11 was finally replaced by the VAX architecture, (early versions included a PDP-11 emulation mode, and were called VAX-11).


PDP-11 FAQ:
http://www.village.org/pdp11/faq.pages/faq.html

Part XI: Motorola 68000, a refined 16/32 bit CPU (September 1979) . . . . . . . . .

The initial 8MHz 68000 was actually a 32 bit architecture internally, but had only a 16 bit data bus and 24 bit address bus to fit in a 64 pin package (address and data shared a bus in the 40 pin packages of the 8086 and Z-8000). Later the 68008 reduced the data bus to 8 bits and address to 20 bits (very slow and not used for much - the cheap and quirky Sinclair QL being the most prominent), and the 68020 was fully 32 bit externally. Addresses were computed as 32 bits (without using segment registers) - unused upper bits in the 68000 or 68008 bits were ignored, but some programmers stored type tags in the upper 8 bits, causing compatibility problems with the 68020's 32 bit addresses. Lack of forced segments made programming the 68000 easier than some competing processors, without the 64K size limit on directly accessed arrays or data structures.

Looking back it was a logical design decision, since most 8 bit processors featured direct 16 bit addressing without segments.

The 68000 had sixteen 32-bit registers, split into eight data and address registers. One address register was reserved for the Stack Pointer. Data registers could be used for any operation, including offset from an address register, but not as the source of an address itself. Operations on address registers were limited to move, add/subtract, or load effective address.

Like the Z-8000, the 68000 featured a supervisor and user mode (each with its own Stack Pointer). The Z-8000 and 68000 were similar in capabilities, but the 68000 was 32 bit units internally (16 bit ALUs, making some 32-bit operations slower than 16-bit - two in parallel for 32-bit data, one for addresses), making it faster and eliminating forced segments. It was designed for expansion, including specifications for floating point and string operations (floating point was added in the 68040 (1991), with eight 80 bit floating point registers compatible with the 68881/2 coprocessor). Like many other CPUs of the time, the 68000 could fetch the next instruction during execution (a 2 stage pipeline). An instruction prefix (0xF) indicated coprocessor instructions (similar to the 80x86), so the coprocessor could "listen" to the instruction stream, and execute instructions it recognized, without a coprocessor bus.

The 68010 (1982) added virtual memory support (the 68000 couldn't restart interrupted instructions) and a special loop mode - small decrement-and-branch loops could be executed from the instruction fetch buffer. The 68020 (1984) expanded external data and address bus to 32 bits, simple 3-stage pipeline, and added a 256 byte cache (loop buffer), with either segmented (68451?) or paged (68851, it supported two level pages (logical, physical) rather than the segment/page mapping of the Intel 80386 and IBM S/360 mainframe) memory management unit. The 68020 also added a coprocessor interface. The 68030 (1987) integrated the paged MMU onto the chip . The 68040 (January 1991) added fully cached Harvard busses (4K each for data and instructions, with new MMU), 6 stage pipeline, and on chip FPU (subset of the 68882, with some operations emulated).

Someone told me a Motorola techie indicated the 68000 was originally planned to use the IBM S/360 instruction set, but the MMU and architectural differences make this unlikely. The 68000 design was later involved in microprocessor versions of the IBM S/370.

The 68060 (April 1994) expanded the design to a superscalar version, like the Intel Pentium and NS320xx (Swordfish) series before it. Like the National Semiconductor Swordfish, and later the Nx586, AMD K5, and Intel's "Pentium Pro", the the third stage of the 10-stage 68060 pipeline translates the 680x0 instructions to a decoded RISC-like form (stored in a 16 entry buffer in stage four). There is also a branch cache, and branches are folded into the decoded instruction stream like the AT&T Hobbit and other more recent processors, then dispatched to two pipelines (three stages: Decode, addr gen, operand fetch) and finally to two of three execution units - 2 integer, 1 floating point) before reaching two 'writeback' stages. Cache sizes are doubled over the 68040.

The 68060 also also includes many innovative power-saving features (3.3V operation, execution unit pipelines could actually be shut down, reducing power consumption at the expense of slower execution, and the clock could be reduced to zero) so power use is lower than the 68040 (4-6 watts vs. 3.9-4.9). Another innovation is that simple register-register instructions which don't generate addresses may use the the address stage ALU to execute 2 cycles early.

The embedded market became the main market for the 680x0 series after workstation vendors (and the Apple Macintosh) turned to faster load-store processors, so a variety of embedded versions were introduced. Later, Motorola designed a successor called Coldfire (early 1995), in which complex instructions and addressing modes (added to the 68020) were removed and the instruction set was recoded, simplifying it at the expense of compatibility (source only, not binary) with the 680x0 line.

The Coldfire 52xx (version 2 - the 51xx version 1 was a 68040-based/compatible core) architecture resembles a stripped (single pipeline) 68060, The 5 stage pipeline is literally folded over itself - after two fetch stages and a 12-byte buffer, instructions pass through the decode and address generate stages, then loop back so the decode becomes the operand fetch stage, and the address generate becomes the execute stage (so only one ALU is required for address and execution calculations). Simple (non-memory) instructions don't need to loop back. There is no translator stage as in the 68060 because Coldfire instructions are already in RISC-like form. The 53xx added a multiply-accumulate (MAC) unit and internal clock doubling. The 54xx adds branch and assignment folding with other instructions for a cheap form of superscalar execution with little added complexity, and uses a Harvard architecture for faster memory access, plus enhancements to the instruction set to improve code density, performance, and to add flexibility to the MAC unit.

At a quarter the physical size and a fraction of the power consumption, Coldfire is about as fast as a 68040 at the same clock rate, but the smaller design allows a faster clock rate to be achieved.


Few people wonder why Apple chose the Motorola 68000 for the Macintosh, while IBM's decision to use Intel's 8088 for the IBM PC has baffled many. It wasn't a straightforward decision though. The Apple Lisa was the predecessor to the Macintosh, and also used a 68000 (eventually - 8086 and slower bitslice CPUs (which Steve Wozniak thought were neat) were initially considered before the 68000 was available). It also included a fully multitasking, GUI based operating system, highly integrated software, high capacity (but incompatible) 'twiggy' 5 1/4" disk drives, and a large workstation-like monitor. It was better than the Macintosh in almost every way, but was correspondingly more expensive.

The Macintosh was to include the best features of the Lisa, but at an affordable price - in fact the original Macintosh came with only 128K of RAM and no expansion slots. Cost was such a factor that the 8 bit Motorola 6809 was the original design choice, and some prototypes were built, but they quickly realized that it didn't have the power for a GUI based OS, and they used the Lisa's 68000, borrowing some of the Lisa low level functions (such as graphics toolkit routines) for the Macintosh.

Competing personal computers such as the Amiga and Atari ST, and early workstations by Sun, Apollo, NeXT and most others also used 680x0 CPUs, including one of the earliest workstations, the Tandy TRS-80 Model 16, which used a 68000 CPU and Z-80 for I/O and VM support - the 68000 could not restart an instruction stopped by a memory exception, so it was suspended while the Z-80 loaded the page. Early Apollo workstations used a similar solution with a second 68000 handling paging.


Motorola:
http://www.mot.com/
Amiga - So the World May Know:
http://www.amiga.com/
Atari compatible Milan computers:
http://www.milan-computer.de/de/index.php3
Atari compatible Medusa computers:
http://www.kingx.com/kingx/medusa/
Sinclair QL homepage:
http://www.uni-mainz.de/~roklein/ql/

Part XII: Intel 8086, IBM's choice (1978)
. . . . . . . . . . . . . . . . . . . .

The Intel 8086 was based on the design of the 8080/8085 (source compatible with the 8080) with a similar register set, but was expanded to 16 bits. The Bus Interface Unit fed the instruction stream to the Execution Unit through a 6 byte prefetch queue, so fetch and execution were concurrent - a primitive form of pipelining (8086 instructions varied from 1 to 4 bytes).

It featured four 16 bit general registers, which could also be accessed as eight 8 bit registers, and four 16 bit index registers (including the stack pointer). The data registers were often used implicitly by instructions, complicating register allocation for temporary values. It featured 64K 8-bit I/O (or 32K 16-bit) ports and fixed vectored interrupts. There were also four segment registers that could be set from index registers.

The segment registers allowed the CPU to access 1 meg of memory through an odd process. Rather than just supplying missing bytes, as most segmented processors, the 8086 actually added the segment registers ( X 16, or shifted left 4 bits) to the address. As a strange result of this unsuccessful attempt at extending the address space without adding address bits, it was possible to have two pointers with the same value point to two different memory locations, or two pointers with different values pointing to the same location, and limited typical data structures to less than 64K. Most people consider this a brain damaged design (a better method might have been that developed for the MIL-STD-1750 MMU).

Although this was largely acceptable for assembly language, where control of the segments was complete (it could even be useful then), in higher level languages it caused constant confusion (ex. near/far pointers). Even worse, this made expanding the address space to more than 1 MB difficult. The 80286 (1982?) expanded the design to 32 bits only by adding a new mode (switching from 'Real' to 'Protected' mode was supported, but switching back required using a bug in the original 80286, which then had to be preserved) which greatly increased the number of segments by using a 16 bit selector for a 'segment descriptor', which contained the location within a 24 bit address space, size (still less than 64K), and attributes (for Virtual Memory support) of a segment.

But all memory access was still restricted to 64K segments until the 80386 (1985), which included much improved addressing: base reg + index reg * scale (1, 2, 4 or 8 bits) + displacement (8 or 32 bit constant = 32 bit address) in the form of paged segments (using six 16-bit segment registers), like the IBM S/360 series, and unlike the Motorola 68030). It also had several processor modes (including separate paged and segmented modes) for compatibility with the previous awkward design. In fact, with the right assembler, code written for the 8008 can still be run on the most recent Pentium Pro. The 80386 also added an MMU, security modes (called "rings" of privilege - kernel, system services, application services, applications) and new op codes in a fashion similar to the Z-80 (and Z-280).

The 8087 was a floating point coprocessor which helped define the IEEE-754 floating point format and standard operations (the main competition was the VAX floating point format), and was based on an eight element stack of 80-bit values. An instruction prefix (0xE0) indicated coprocessor instructions (similar to the 68000), so the coprocessor could "listen" to the instruction stream, and execute instructions it recognized, without a coprocessor bus.

The 80486 (1989) added full pipelines, single on chip 8K cache, FPU on-chip, and clock doubling versions (like the Z-280). Later, FPU-less 80486SX versions plus 80487 FPUs were introduced - initially these were normal 80486es where one unit or the other had failed testing, but versions with only one unit were produced later (smaller dies and reduced testing reduced costs).

The Pentium (late 1993) was superscalar (up to two instructions at once in dual integer units and single FPU) with separate 8K I/D caches. "Pentium" was the name Intel gave the 80586 version because it could not legally protect the name "586" to prevent other companies from using it - and in fact, the Pentium compatible CPU from NexGen is called the Nx586 (early 1995). Due to its popularity, the 80x86 line has been the most widely cloned processors, from the NEC V20/V30 (slightly faster clones of the 8088/8086 (could also run 8085 code)), AMD and Cyrix clones of the 80386 and 80486, to versions of the Pentium within less than two years of its introduction.

MMX (initially reported as MultiMedia eXtension, but later said by Intel to mean Matrix Math eXtension) is very similar to the earlier SPARC VIS or HP-PA MAX, or later MIPS MDMX instructions - they perform integer operations on vectors of 8, 16, or 32 bit words, using the 80 bit FPU stack elements as eight 64 bit registers (switching between FPU and MMX modes as needed - it's very difficult to use them as a stack and as MMX registers at the same time). The P55C Pentium version (January 1997) is the first Intel CPU to include MMX instructions, followed by the AMD K6, and Pentium II. Cyrix also added these instructions in its M2 CPU (6x86MX, June 1997), as well as IDT with its C6.

Interestingly, the old architecture is such a barrier to improvements that most of the Pentium compatible CPUs (NexGen Nx586/Nx686, AMD K5, IDT-C6), and even the "Pentium Pro" (Pentium's successor, late 1995) don't clone the Pentium, but emulate it with specialized hardware decoders like those introduced in the VAX 8700 and used in a simpler form by the National Semiconductor Swordfish, which convert Pentium instructions to RISC-like instructions which are executed on specially designed superscalar RISC-style cores faster than the Pentium itself. Intel also used BiCMOS in the Pentium and Pentium Pro to achieve clock rates competitive with CMOS load-store processors (the Pentium P55C (early 1997) version is a pure CMOS design).

IBM had been developing hardware or software to translate Pentium instructions for the PowerPC in a similar manner as part of the PowerPC 615 CPU (able to switch between instruction 80x86, 32-bit and 64-bit PowerPC instruction sets in five cycles (to drain the execution pipeline)), but the project was killed after significant development for marketing reasons. Rumor has it that engineers who worked on the project went on to Transmeta corporation.

The Cyrix 6x86 (early 1996), initially manufactured by IBM before Cyrix merged with National Semiconductor, still directly executes 80x86 instructions (in two integer and one FPU pipeline), but partly out of order, making it faster than a Pentium at the same clock speed. Cyrix also sold an integrated version with graphics and audio on-chip called the MediaGX. MMX instructions were added to the 6x86MX, and 3DNow! graphics instructions to the 6x86MXi. The M3 (mid 1998) turned to superpipelining (eleven stages compared to six (seven?) for the M2) for a higher clock rate (partly for marketing purposes, as MHz is often preferred to performance in the PC market), and was to provide dual floating point/MMX/3DNow! units. The Cyrix division of National Semiconductor was purchased by PC chipset maker Via, and the M3 was cancelled. National Semiconductor continued with the integrated Geode low-power/cost CPU.

The Pentium Pro (P6 execution core) is a 1 or 2-chip (CPU plus 256K or 512K L2 cache - I/D L1 cache (8K each) is on the CPU), 14-stage superpipelined processor. It uses extensive multiple branch prediction and speculative execution via register renaming. Three decoders (one for complex instructions (up to four micro-ops), two for simpler ones) each decode one 80x86 instruction into micro-ops (one per simple decoder + up to four from the complex decoder = three to six per cycle). Up to five (usually three) micro-ops can be issued in parallel and out of order (five units - integer+FPU ALU, integer ALU, two address, one load/store), but are held and retired (results written to registers or memory) as a group to prevent an inconsistent state (equivalent to half an instruction being executed when an interrupt occurs, for example). 80x86 instructions may produce several micro-ops in CPUs like this (and the Nx586 and AMD K5), so the actual instruction rate is lower. In fact, due to problems handling instruction alignment in the Pentium Pro, emulated 16-bit instructions execute slower than on a Pentium. The Pentium II (April 1997) added MMX instructions to the P6 core (both ALUs), doubled cache to 32K, and was packaged in a processor card instead of an IC package. The Pentium III added Streaming SIMD Extensions (SSE) to the P6 core (both ALUs), which included eight 128-bit registers which could be used as vectors of four 32-bit integer of floating point values (like the PowerPC AltiVec extensions, but with fewer operations or data types). Unlike MMX (and like AltiVec), the SSE registers need to be saved separately during context switches, requiring OS modifications.

In June 1998, Intel created two sub-brands of P6 CPUs, low cost (Celeron) and server oriented (Xeon). They differed in amount of cache and bus speeds.

The P7 was first released as the Pentium 4 in December 2000. This equivalent to AMD's K7 (see below) was late due to the decision to concentrate on the development of the IA-64 architecture. Intel used two teams for alternating 80x86 designs, the P5 team started work on the P7, originally a 64 bit version like the AMD K8, while the other team worked on the P6. When the 64-bit P7 was changed to the IA-64, the P6 team started on a scaled down P7 after the Pentium III was finished - meanwhile, Intel sold "overclocked" (small quantities able to run at a higher than designed clock rate) P6 CPUs to compete with the AMD K7, then later updated P6 designs.

The P7 extended the pipeline even further to over 20 stages (or 30 during cache misses), stressing clock speed over execution speed (for marketing reasons) - this led to some questionable design decisions. The three decoders are replaced by single decoder and a trace cache - similar in concept to the decoded instruction cache of the AT&T Hobbit, but 80x86 instructions often decode into multiple micro-ops, so mapping the micro-ops to memory is more complex, and instructions are loaded ahead of time using branch prediction. This speeds execution within the cache, but the single decoder limits the external instruction stream to one at a time. Long micro-op sequences are stored in microcode ROM and fed to the dispatch unit without being stored in the cache.

There are seven execution units, one FPU/MMX/SSE, one FP register load/store unit, two add/subtract integer units, one logic (shift and rotate) unit, one load and one store unit. The add/subtract units run at double the clock rate, basically as a two stage pipe, allowing two results within a single clock cycle, meaning up to nine micro-ops could be dispatched each cycle to the seven units, but in practice the trace cache is limited to three per cycle. A slower logic unit replaces two faster address units in the P6, slowing most code. Since the stack-oriented FPU registers are difficult to use for superscalar or out-of-order execution, Intel added floating-point SSE instructions (called SSE2), so that floating point operations can use the flat SSE registers which will make future designs easier, and the old FPU design becomes less important.

The bottlenecks might have been a result of rushing the design, or due to cost. As a result, the P7 executing existing code is actually slower than a P6 at slightly lower clock speed, and much slower than the AMD K7, but the intent of the design was to allow clock speed to be increased enough to make up for the difference. Possibly the bottlenecks will be removed in a future version.

A server (Xeon) version of the P7 (March 2002) introduced vertical multithreading (called "Hyperthreading" by Intel), similar to the IBM Northstar CPU (or Sun MAJC) - the main difference being that the Northstar will wait for a cache miss delay before switching threads, while full multithreading used by Intel always interleaves a small number of threads in the normal execution pipeline. It was later expanded to 64 bits (see AMD K8 below).

AMD was a second source for Intel CPUs as far back as the AMD 9080 (AMD's version of the Intel 8080). The AMD K5 translates 80x86 code to ROPs (RISC OPerations), which execute on a RISC-style core based on the unproduced superscalar AMD 29K. Up to four ROPs can be dispatched to six units (two integer, one FPU, two load/store, one branch unit), and five can be retired at a time. The complexity led to low clock speeds for the K5, prompting AMD to buy NexGen and integrate its designs for the next generation K6.

The NexGen/AMD Nx586 (early 1995) is unique by being able to execute its micro-ops (called RISC86 code) directly, allowing optimized RISC86 programs to be written which are faster than an equivalent x86 program would be, but this feature is seldom used. It also features two 16K I/D L1 caches, a dedicated L2 cache bus (like that in the Pentium Pro 2-chip module) and an off-chip FPU (either separate chip, or later as in 2-chip module).

The Nx586 successor, the K6 (April 1997) actually has three caches - 32K each for data and instructions, and a half-size 16K cache containing instruction decode information. It also brings the FPU on-chip and eliminates the dedicated cache bus of the Nx586, allowing it to be pin-compatible with the P54C model Pentium. Another decoder is added (two complex decoders, compared to the Pentium Pro's one complex and two simple decoders) producing up to four micro-ops and issuing up to six (to seven units - load, store, complex/simple integer, FPU, branch, multimedia) and retiring four per cycle. It includes MMX instructions, licensed from Intel, and AMD has designed and added 3DNow! graphics extensions without waiting for Intel's SSE additions.

AMD aggressively pursued a superscalar (fourteen-stage pipeline) design for the Athlon (K7, mid 1999), decoding x86 instructions into 'MacroOps' (made up of one or two 'micro-ops', a process similar to the branch folding in the AT&T Hobbit or instruction grouping in the T9000 Transputer and the Motorola 54xx Coldfire CPU) in two decoders (one for simple and one for complex instructions) producing up to three MacroOps per cycle. Up to nine decoded operations per cycle can be issued in six MacroOps to six functional units (three integer, each able to execute one simple integer and one address op simultaneously, and three FPU/MMX/3DNow! instructions (FMUL mul/div/sqrt, FADD simple/comparisons, FSTORE load/store/move) with extensive stack and register renaming, and a separate integer multiply unit which follows integer ALU 0, and can forward results to either ALU 0 or 1). The K7 replaces the Intel-compatible bus of the K6 with the high speed Alpha EV6 bus because Intel decided to prevent competitors from using its own higher speed bus designs (Dirk Meyer was director of engineering for the K7, as well as co-architect of the Alpha EV4 and EV6). This makes it easier to use either Alpha or AMD K7 processors in a single design. At introduction, the K7 managed to out-perform Intel's fastest P6 CPU.

Centaur, a subsidiary of Integrated Device Technology, introduced the IDT-C6 WinChip (May 1997), which uses a much simpler (6-stage, 2 way integer/simple-FPU execution) design than Intel and AMD translation-based designs by using micro-ops more closely resembling 80x86 than RISC code, which allows for a higher clock rate and larger L1 (32K each I/D) and TLB caches in a lower cost, lower power consumption design. Simplifications include replacing branch prediction (less important with a short pipeline) with an eight entry call/return stack, depending more on caches. The FPU unit includes MMX support. The C6+ version adds second FPU/MMX unit and 3D graphics enhancements.

Like Cyrix, Centaur opted for a superpipelined eleven-stage design for added performance, combined with sophisticated early branch prediction in its WinChip 4. The design also pays attention to supporting common code sequences - for example, loads occur earlier in the pipeline than stores, allowing load-alu-store sequences to be more efficient.

Cyrix division of National Semiconductor and the Centaur division of IDT were bought by Korean motherboard chipset maker Via. The Cyrix CPU was cancelled, and the Centaur design was given the "Cyrix III" brand instead.

Intel, with partner Hewlett-Packard, developed a next generation 64-bit processor architecture called IA-64 (the 80x86 design was renamed IA-32) - the first implementation was named Itanium. It's was intended to be both compatible in some way with both the PA-RISC and 80x86. This may finally produce the incentive to let the 80x86 architecture finally fade away.

On the other hand, the demand for compatibility will remain a strong market force. AMD announced its intention to extend the K7 design to produce an 80x86 compatible K8 (codenamed "Sledgehammer", then changed to just "Hammer" - variants indicate market segments, such as "Clawhammer" (keeping the Athlon brand name) for desktops, and "Sledgehammer" (named Opteron) for servers). It produced a 64-bit architecture called x86-64, in competition with the Intel IA-64.

When moving from the 80286 to the 80386 (IA-32), Intel took the opportunity to fix some of the least liked features remaining in the previous design. Moving to x86-64, AMD decided to further modernize the design, adding a cleaner 64-bit mode (selected by Code Segment Descriptor (CSD) register bits).

It's based on sixteen 64-bit integer and sixteen 128-bit vector/floating point (XMM) registers (the lower eight registers of each map to the original x86 integer and SSE/SSE2 registers) and the 8087 FPU/MMX registers, with a 64-bit program counter. In 64-bit mode, integer registers are uniform and can be 8-, 16-, 32-, or 64-bit. Address space is changed from mainly segmented to a flat space (keeping data segment registers in 32- or 16-bit sub-modes) with PC relative addressing, although code segments (within the address space) are still used to define the modes for each segment. Older 8086 modes are supported in a separate "legacy" mode. These changes give compilers a larger, more regular register set to use making optimizations easier.

Rumors persisted that Intel was developing a CPU codenamed "Yamhill", originally based on original 64-bit P7 plans dusted off, but then switching to the x86-64 architecture and instruction set (apparently under pressure from Microsoft to avoid creating yet another instruction set to support - ironically making Intel a follower of AMD, after driving 80x86 development from the beginning). Originally it was an unofficial project, then official when performance of the first Itanium disappointed, and K8 popularity exceeded expectations. It was finally released as an "enhanced" Pentium 4 Xeon (March 2004), despite being a new design. The 64-bit capability is designed as a 32-bit add-on (like old bit-slice processors) and is disabled in lower end versions, (much like the low-cost 80486SX FPU was disabled). When enabled, the extended 32 bit pipeline operates 1/2 clock cycle later than the main pipeline.

It has the same registers, addressing modes and extensions as the AMD K8, but is otherwise similar to the 64-bit P7, with the same pipeline, including double-clocked add/subtract units, though fixing the bottlenecks and using larger caches, buffers, etc. In addition, there are separate SSE2 functional units, rather than using the older FPU units for SSE operations as the P7 did.


So why did IBM chose the 8-bit 8088 (1979) version of the 8086 for the IBM 5150 PC (1981) when most of the alternatives were so much better? Apparently IBM's own engineers wanted to use the 68000, and it was used later in the forgotten IBM Instruments 9000 Laboratory Computer, but IBM already had rights to manufacture the 8086, in exchange for giving Intel the rights to its bubble memory designs. IBM was using 8086s in the IBM Displaywriter word processor (the 8080 and 8085 were also used in other products).

Other factors were the fact that the the 8-bit 8088 could use existing low cost 8085-type components, and allowed the computer to be based on a modified 8085 design. 68000 components were not widely available, though it could use 6800 components to an extent. After the failure and expense of the IBM 5100 (1975, their first attempt at a personal computer - discrete random logic CPU with no bus, built in BASIC and APL as the OS, 16K RAM and 5 inch monochrome monitor - $10,000!), cost was a large factor in the design of the PC. Strategists were also not eager to have a microcomputer competing with IBM's low end minicomputers.

The availability of CP/M-86 is also likely a factor, since CP/M was the operating system standard for the computer industry at the time. However Digital Research founder Gary Kildall was unhappy with the legal demands of IBM, so Microsoft, a programming language company, was hired instead to provide the operating system (initially known at varying times as QDOS, SCP-DOS, and finally 86-DOS, it was purchased by Microsoft from Seattle Computer Products and renamed MS-DOS).

Digital Research did eventually produce CP/M 68K for the 68000 series, making the operating system choice less relevant than other factors.

Intel bubble memory was on the market for a while, but faded away as better and cheaper memory technologies arrived.


Intel Corporation:
http://www.intel.com/
Intel Product Info:
http://www.intel.com/intel/product/index.htm
AMD, Inc.:
http://www.amd.com/
AMD PC Processors Plus:
http://www.amd.com/products/cpg/cpg.html
Cyrix:
http://www.cyrix.com/
Centaur:
http://www.centtech.com/
IBM's PC CPU decision (near beginning of page):
http://www.cs.mu.oz.au/313/stories
An Interview with the Old Man of Floating-Point
http://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html

Busses used on x86 architectures

Much of this section is taken from Phil Storr's(pstorr@iweb.net.au) page on PC busses, which may be found at http://members.iweb.net.au/~pstorr/pcbook/book2/busses.htm

The PC external Bus slots

A brief history of the PC busses

In the early days of microprocessor chip based microcomputers, all microcomputers were built using their own proprietary bus designs. Before long someone had the bright idea that if designers used the same design specifications, you could build a computer out of "boards" from different companies. This idea created the the S-100, a bus that is still in use today in some areas. About the middle of 1975 Apple used an expansion bus on their Apple II, and its success set the stage for the desk top computers that followed, including the IBM PC when it appeared in 1981. The original PC used Intel's 8088 processor which was a 16-bit CPU that spoke to the world through an 8-bit data path. The original PC 8-bit bus slot is still used by some simple I/O cards today. The 8088 ran at 4.77MHz, which was fine for the expansion cards, and running the expansion slots at the same clock speed as the CPU made the system boards easier to design and cheaper to build.

The IBM AT introduced a 16 bit data bus and the expansion slots had to handle 16 data bits. The industry wanted to be able to use existing 8-bit cards, so the new "AT" slot had to be designed to be backward compatible with the PC slots. The AT extension connector was added to the end of the 62 pin edge connector of the original 8-bit bus slot. This extension is a 36 pin edge connector. This bus slot was later given the name Industry Standard Architecture (ISA) and has survived to this day. One important aspect of this bus was that IBM never made any specification about bus speeds.

In the original 6MHz IBM AT, and the subsequent 8MHz version, the bus simply ran along at the same speed as the CPU. It was not surprising that as clone vendors started looking for a marketing edge over IBM, they simply kept the bus running at the CPU speed as they boosted speeds to 10MHz, 12MHz, and even faster. This lead to problems with users starting to run into problems. Boards that ran fine in a 6 or 8 MHz computer were not reliable in faster ones. The problem was especially severe with network cards. It turned out that they couldn't run at these higher clock speeds. The industry eventually settled on 8MHz as the standard maximum clock speed and the name Industry Standard Architecture.

Proprietary Bus problems

Everything was fine until Intel made the 80386 available. Here was a processor that could access the world in 32-bit chunks and how should the industry provide for the wider data path?. In particular the data bus to the RAM needed to be 32 bits wide in order to take advantage of the 386 processors wider data bus. Up to this point many DOS computer systems has some of it's RAM on "expansion cards" plugged into the Bus slots and the ISA bus limited such RAM to being only 16 bit wide and to an access speed of 8MHz.

One answer was to put the system memory on a local bus with the processor on the system board. The memory could be connected directly to the processors data bus and have no buffer devices between it and the processor. This way it could be 32 bits wide and accessed at the processors clock speed. At this stage in the development of RAM technology the industry was still using DIL package RAM chips of 256k bits or one Meg bit capacity and it took a lot of system board real estate to fit in more than a few megabytes of RAM.

Many companies decided to make special 32-bit expansion slots for proprietary memory boards that could be added later. This is where we can learn a lesson - many owners of computers with these system boards soon discover that they could not find these proprietary boards for their computers only a few months after they purchased the computer. Many manufacturers realized that a standard 32-bit bus was a better answer than many proprietary designs, and if it could run at the processor bus speed that would be even better.

Micro Channel Architecture (MCA) and Extended Industry Architecture (EISA)

IBM tried to regain control of the PC computer market with it's PS/2 range and the Micro Channel (MCA) bus. The response from a number of influential clone manufacturers was to get together and design the Extended Industry Standard Architecture (EISA) bus, providing a 32-bit data path. The advantage of the EISA design over Micro Channel was that it remained backward compatible with ISA boards, right back to 8-bit cards. The cost of the computers using the MCA or EISA buses were high and so these new busses failed to get much of a market share. Computer purchasers went on buying more ISA Bus machines than anything else.


The bottom card is an MCA card, the top card is an ISA bus card


Here is an ISA bus card sitting on top of an EISA bus card

The need for speed

Eventually a point was reached where the ISA bus just was no longer fast or wide enough. Windows 3.1 raised user expectations for high-resolution graphics displays and more than VGA's original 16 colours. The images produced required far more data than a simple text screen, and so performance was unacceptable as the computer tried to squeeze megabytes of information per second through a 16-bit bus running at 8MHz.

Not only would the video system benefited from a faster/wider bus, faster hard drives and hard drive interfaces and network interface cards had outgrown the ISA bus. One solution was to design a faster bus for video and other components. Bringing the Bus Slot speed up to the then typical Bus Clock speed of 33 MHz, would provide a four-fold increase in data transfer rate. Double the width of the data bus from 16 to 32 bits, and the transfer rate could be up to eight times that of the ISA bus.

Some designers started by simply wiring video circuitry into the CPU bus on system boards. The system board already had a "local bus" between the processor and it's RAM and this could be extended to include the video interface. This provided speed gains, but at the cost of flexibility. If you wished to upgrade the Video System all you could do was to disable the video on the system board, and resort to an ISA card in a bus slot.

The next solution was a throwback to the proprietary 32-bit memory cards of the early 386 systems. Designers created their own unique solutions for local bus video slots. This approach left the buyer dependent on the original vendor to develop and offer new video options as technologies change and improve and at the rate of development of PC hardware, this usually never happened.

The VESA LOCAL BUS - a Bus buss built by a committee


A VESA bus Video card

The problem was first solved by the Video Electronics Standards Association. This is the group that made sense out of the mayhem that occurred when vendors tried to go beyond IBM's original specification for VGA. When you wanted to run a system at higher than VGA's original specification of 640 by 480 you had to get drivers that worked with your application programs and hardware.

The VESA standards for Super VGA signal timing and resolutions sorted out much of this trouble. The committee set some basic goals for a local bus specification. It had to be low cost, based on existing technology and system chip sets as much as possible. It had to offer significantly higher performance, handling not only the present data transfer loads, but the additional traffic expected from even higher resolution displays and multimedia applications. It had to be an open standard, so anyone could use it, and it had to be software transparent, so you would not need to use any troublesome drivers. It also should be also extendible to handle future technology, such as the Pentium processor with its 64-bit data path.

The result was the VESA-Bus specification. This set forth the basic characteristics of the bus, such as mechanical, physical, timing, and protocol details. For maximum flexibility, it was designed in such a way that it could easily be added to ISA, EISA, or Micro Channel system boards. To keep the design simple, the committee designed the VESA-Bus as an extension of the internal bus used within the 80486 processor. As a result, the VESA-Bus could use the full address range of the 486 chip.

Many VESA slot equipped computer systems used a VESA, IDE/FDC/SPG interface card. The IDE interface on this card was the only part of the card that used the VESA-bus slot. The Floppy Disk Controller and the SPG functions still used the ISA portion of the slot.

Local Bus devices can be implemented with devices either integrated into the system board, or plugged into an expansion slot. The problem with integrated devices is the higher bus clock speeds push technology to its engineering limits. As the signals travel around the traces of printed circuit boards faster and faster, it is more and more difficult to maintain accurate timings. If an electrical signal is slowed too much on the way to its destination, then critical events may not take place at the correct instant, and processing crashes to a halt.

The faster the CPU runs, the smaller the load it can handle (the load on it's outputs). Sending a signal through an expansion slot rather than to a device located on the system board adds to the load on the bus. The VESA committee recommended only two VESA-Bus slots and two VESA-Bus devices (system board mounted devices) with a 33MHz (or slower) bus speed, one slot at 40MHz, and no slots at all for a 50MHz bus speed.

The VESA Bus connector

A Micro Channel connector was used, placed in line with the existing ISA expansion slots. This layout meant the ISA slot was still available if the slot was not occupied by a VESA-Bus adaptor. The VESA-Bus devices did not make any use of the original ISA bus (except for the power connections) but instead got all the signals necessary from the VESA connector. This meant the original ISA bus connections were available to the expansion card for other purposes. Manufacturers were able to put multiple devices on a single expansion card, such as a IDE Hard Drive interface on the Local Bus and Floppy Disk Controller and SPG I/O functions on the ISA portion of the slot

The VESA Bus was soon obsolete

In 1993 plans were under way to develop the VESA-Bus to include a 64 bit version for the Pentium and to increase the number of devices that could be put on the Bus. The success of the PCI-Bus put an end to any further development. The VESA-Bus served the PC industry well for about two years but it faded away as more and more system board offered only PCI local bus slots.

How much faster is the VESA-Bus ?

The theoretical transfer rate for the ISA bus is about 5 Mbits per second, and the EISA-Bus is about 32 Mbits per second. A VESA Bus with a 33MHz CPU clock speed could provide transfer rates of 132 Mbits per second, 26 times more data than the ISA-Bus. The VESA-Bus was simply referred to as the Local-Bus by some authors and vendors.

THE PCI LOCAL BUS - a bus built by INTEL


A typical PCI video interface card


A PCI video card in a PCI Bus slot on a 686 system board

The PCI-Bus (Peripheral Component Interconnect) was originally designed to speed up the display of graphics on Intel-based personal computers, but the standard itself is processor independent and suitable for other hardware add-ons that require high bandwidth, including network, video and SCSI adaptors. PCI was developed by INTEL but it did take some time to get it to work reliably. By the middle of 1993 the VESA-Bus became firmly entrenched in the market place and almost all DOS computer systems had VESA-Bus slots as standard. The wide acceptance of local bus technology only took a few months and by default, VESA-Bus become the first Local Bus standard.

For a while, many people in the computer industry saw a local-bus war between the two competing local-bus standards (VESA-Bus and PCI-Bus) but in reality they were not in the same battlefield. The PCI and VESA Local-Busses did basically the same thing - both speed up PC computers by letting peripherals like graphics adaptors and hard disk controllers run at up to 33MHz, instead of the 8MHz that the ISA-Bus limited them to. The similarity breaks down when we start talking about how the two designs work.

The VESA-Bus bypassed the ISA bus by using the same bus the CPU is connected to it's RAM memory by and so it was relatively cheap and easy for system and peripheral makers to implement. Intel's PCI-Bus on the other hand, was a whole new bus, in much the same way the EISA and MCA busses were. The PCI bus gave only a slight speed improvement when used with 486 based systems, but it was far ahead when used with the Pentium chip.

Some more technical details of the PCI bus

The PCI-Bus has some attractive features, such as concurrent bus-mastering, a full burst mode, and a type of pipe lining queue that can reduce the number of potential wait states compared to the VESA-Bus design.

The PCl-Bus uses three elegant techniques to resolve local bus problems. The first, known as reflective wave signaling, reduces the amount of electrical amplification required on the signal paths and thus reduces noise and loading problems. The second is multiplexing. Multiplexing allows two different signals to use the same electrical path, reducing the number of pins required for peripheral chips and lowering manufacturing costs. The third is a protocol letting the PCl controller receive specific configuration information from the PCl devices themselves. Intel did not defined a standard adaptor connector for the bus, leaving that job up to a PCl-Bus special-interest group who settled on the white 112 pin connector.

PCI the Universal Bus

PCI is platform independent and was soon used in computers built around the PowerPC chip. This is one of the few times a standard I/O bus has been used across platforms and so this has to be a big feature in it's favor. The various companies involved in the PowerPC development, including Apple and IBM adopted the PCI-Bus for PowerPC based computers. Apple had been using the Macintosh NuBus for many years, but switched to the PCI-Bus for it's PowerPC products. It is ironical that the largest user of Motorola based processors lined up to buy bus technology from Intel.

Other computer manufacturers are also using the PCI-Bus in there computer platforms with Digital Equipment Corp. (DEC) with their Alpha RISC-based systems, and Hewlett-Packard and SUN Microsystems all including PCI-Bus slots in there products. Intel licensed its patents on the PCI Bus free of royalties to all who wished to use it.

By adopting an established industry standard the manufacturers of the other computer platforms are ensuring lower costs and more options for both users and developers who are no longer locked into their own proprietary options. The wide range of cards that have followed the use of the PCI-Bus on PC systems are available for the first time to users of other hardware. All that should be required is alternative driver software for the various platforms.

Multi Bus architecture

Multi Bus System boards helped to overcome problems in the continuing evolution of the DOS computer buses. System boards with VESA-Bus slots are dual bus boards as they have ISA and VESA-Bus slots by the design of the VESA-Bus standard.

Many combinations of the various buses that have been available over the years are possible and some system board manufacturers produced boards with combinations of ISA, EISA, MCA, VESA and PCI-Bus. This was to allow users to make use of older exotic cards such as SCSI controllers and hardware cache boards in upgraded equipment.

Most system boards available today still have two ISA bus slots but there are PCI bus slot only boards, and EISA and PCI only boards available.

What else is wrong with the good old ISA bus ?

The ISA busses 24 address lines limit it to allowing I/O cards to use the first 16 Meg of addressable memory space for RAM or ROM on the card. Some specialized video cards and video capture cards look for a memory aperture (also known as a linear frame buffer), a hole in system memory, where they can insert and address their own continuous 1 or 2 Megabyte of video memory. This memory aperture overcomes the problem of page switching brought about by the assignment of only a 128 Kbyte area for the Video RAM. Remember most VGA video cards have at least 1 Megabyte of Video RAM on the video card but they must access it by switching parts of the RAM in and out of the assigned memory range.

The Characteristics of the various busses
Bus type Bus data width Bus speed Data transfer rate
PC/XT 8 bits 4.7 - 8 MHz 3.25 (Mbits/Sec)
ISA 16 bits 8 MHz 6.5 (Mbits/Sec)
EISA 32 bits 8 MHz 32 (Mbits/Sec)
MCA 32 bits 8 MHz 20 (Mbits/Sec)
VESA 32 bits 33 MHz to 50 MHz 132 (Mbits/Sec) and above
PCI 32 bits 33 MHz 132 (Mbits/Sec)

The Advanced Graphics Port (AGP Bus)

AGP was announced, and started to find it's way into "top end" System Boards, during the last few months of 1997. This is called a Port rather than a Bus because it is intended for a particular purpose, instead of a universal bus slot. AGP is based on the latest PCI specification (ver 2.1), running at 66 MHz instead of 33 MHz like all existing PCI Bus cards, and having three extensions to the PCI specification. These extensions are:
  1. Pipelined memory read/write operations
  2. Demultiplexing of address and data on the bus
  3. Timing for data transfer rate as if clocked at 133 MHz
Due to the high data transfer rate between the graphics accelerator and main memory, AGP enables graphic accelerators to use main memory in addition to memory on the Video card.

This memory is referred to as AGP Memory. AGP in theory allows a peak data transfer rate of up to 528 Mbytes/second between the PC's main memory and the AGP graphics accelerator, compared to a transfer rate of only 132 Mbytes/second attainable by today's PCI bus. Doubts exist about this claim because this figure is the whole bandwidth of main memory and it has to be shared with CPU and other devices. AGP may never be able to get a throughput of 528 MB/s, but the trend to 100MHz bus speeds will speed main memory transfers and make this more likely.

Like most other modern PC developments, the chipset has to provide services for the AGP bus, in particular, the function to map the 'AGP memory' to normal main memory. Intel calls this GART (Graphics Address Remapping Table). This means the Video Interface can use some of the System Memory rather than having dedicated Video RAM on the card.

The benefits AGP is offering:

  1. Higher bandwidth than PCI, up to 4 times as high
  2. No sharing of bandwidth with other components like the PCI bus
  3. DIME (direct memory execution) of textures
  4. CPU accesses to system RAM can proceed concurrently with the graphics chip's AGP RAM reads Allowing the CPU to write directly to shared system AGP memory when it needs to provide graphics data, such as commands or animated textures. Generally the CPU can more quickly access main memory than it can graphics local memory via AGP, and certainly faster than via the PCI bus.

Software Considerations
Unfortunately, getting an AGP board plus an AGP graphic accelerator won't be enough to take advantage of AGP's new performance. The operating system has to take care of particularly the DIME/GART part of the AGP benefits. The Operating System has to provide main memory for the AGP RAM. This is achieved via DirectDraw in Windows98 and Windows NT 5.

Example of a Pentium II System Board with an AGP socket

Serial Ports

PC Hardware these days comes equipped with two Serial Ports. IBM originally called these Communications ports and so you will find them more often referred to as COM ports.

Serial Ports can be used for:

UART (Universal Asynchronous Transmitter Receiver)

This is the heart of the Serial Port with this device performing the parallel to serial and serial to parallel conversions and providing the Hand Shaking between the two devices connected together.

The history of the UART chip
Over the years, since the introduction of the DOS computer, three types of UARTS have been used in this hardware. The first was the 8250 chip, this was followed by the 16450 chip, and then the 16550 chip.
UART type Max. Data Rate
8250 9600 bits/second
16450 30K bits/second
16550 >100K bits/second

The 8250 chip was used in the Serial Ports of PC or XT computers, and the 16450 in the Serial Ports of 286 (AT) and then 386, and 486 machines, until early 1995. Over the years the maximum data rate provided by devices connected to the Serial Ports has been steadily rising. Back in 1987 a 2.4Kbits/second Telephone Modem was considered fast. The most cost effective Telephone Modems today are transferring data at as fast as 56Kbits/sec with 33.6Kbits/sec modems being phased out rapidly. The Serial Ports must keep up with the modem and therefore the UART must be faster than the modem.

UART stands for UNIVERSAL ASYNCHRONOUS RECEIVER TRANSMITTER
The main job of a UART is to convert the computer's parallel data from the bus into a serial flow for transmission and when information is being received, the UART collects it into bytes (8 bits) and passes those bytes onto the bus. The UART provides the shift registers for parallel to serial and for serial to parallel conversions and all the Flow Control (hand shaking) required to control the flow of data to and from the computer and some other device.

What does the UART chip look like ?
The 16550 is otherwise identical to its predecessors and so it can be used as a 16450 or 8250 replacement. The older I/O cards found in DOS computer hardware had a 40 pin DIL UART chip mounted on a socket and so it would be simple to upgrade by just replacing the chip. A 40 pin 16550 chip usually costs more than a new I/O card and so it is not economical to replace the UART chip on old I/O cards. Some SPG cards had a 16450 chip soldered in, making it almost impossible to replace it. Another chip you will sometimes find on old I/O cards is the 16451. This chip is a 16450 with a Parallel Interface as well as the UART.

Starting with the SPG and IDE/FDC-SPG cards used in 386 and 486 hardware, the UARTS were in a chip called an ASICS chip. This was a custom VLSI chip that contained 2 UARTs, the Parallel Port, Games Port, and often the Floppy Disk Drive Controller.

Modern PC hardware has the UARTS built into the System Boards chip-set and all these provide 16550 type UARTS, capable of data rates in excess of 100Kbits/second. Diagnostic software is available that will detect the type of UARTs fitted.

The TTL logic levels from the UART device require Line Driver chips to convert the output signals to RS232 levels, and other Line Driver chips to convert the input signal RS232, levels back to TTL logic levels. Today these drivers are often built into the ASICS chip or the chip-set but the 1488 and 1489 line driver devices were used for many years. The line drivers require + and - 12 to 15 Volts supplies. Alternative Line Driver Chips are available that generate the + and - 12 Volts inside the chip and these are used in note-book type computers. The Line Driver Chips often fail due to near lightning strikes and ground potential faults.

How many wires do we need for a serial connection ?

RS232 connection can be made with as few as 2 wires and a ground wire, but a full implementation of the standard uses 9 wires. The Serial Ports on the back of the DOS computer use either a DB9P or a DB25P plug and maximum cable length in unshielded cables is about 30 meters. If you need to put a printer some distance from the computer it is being used on it is necessary to use a Serial Interface rather than a Parallel Interface. The serial interface is not as prone to electrical interference from power wiring.

The I/O assignments used for the Serial Ports:

The Serial Port requires a small range of I/O addresses and an IRQ line. The original DOS assignments were like this.
COM Port I/O Address IRQ
COM 1 3F8 to 3FF IRQ 4
COM 2 2F8 to 2FF IRQ 3

With the introduction of DOS version 3.1 provision was made to have two more Serial Ports and the resources assigned to these were:
COM Port I/O Address IRQ
COM 3 3E8 to 3EF IRQ 4
COM 4 2E8 to 2EF IRQ 3

The problem with the extra two Serial Ports is that they do not have unique IRQ lines assigned to them and some hardware and/or software is not good at sharing such resources. Specialized Serial Interface Cards are available that provide four or eight Serial Ports and these are intended for use in UNIX systems and they may not have driver software for DOS systems. These cards are often used for "point of sale" computers in installations like Service Stations and Supermarkets.

Using internal modems

Modern PC hardware has 2 Serial Ports, and these can be configured as either of the four available COM Ports. If you wish to use an Internal Modem on either COM 1 or COM 2 you must turn off the corresponding COM Port on the System Board (or I/O card with older hardware), or configure it as one of the other COM ports. An Internal Modem has a Serial Interface built into the Modem card and must be configured so it does not clash with other hardware in the system. You can often configure an External Modem as COM 3 or COM 4 but watch out for IRQ 4 and IRQ 3 clashes. Some ports and software will allow two COM ports to share the same IRQ line. Note: Modern PC System Boards provide control over the "on board" Parallel and Serial Ports via the CMOS setup routines.

Using a printer with a Serial Interface on a PC

Most application software configures the Serial Ports UART for Data Rate, Word Size and Parity. If you wish to use a Serial Port at the DOS level, you must use the DOS external command called MODE to do this.

If you wished to use a printer with a serial interface instead of one with a parallel interface, When running DOS applications, you would have to add these 2 lines to the AUTOEXEC.BAT file. This is not required when using a MSWindows Operating System.

MODE COM1: 9600,N,8,1,P
MODE PRN = COM1: (The alternative to this line is MODE LPT1: = COM1:)

The actual values in the first line above will depend on the parameters required by the printer. This information is obtained from the printer handbook and will be similar to the listing below.

Bit rates with the latest UART devices used in PC computers can be far higher than 9600, with speeds of 19.2K, 28.8K, 33.6K and 56K being used today.

The "P" tells the Serial Port Service Routine to wait for the device on the other end, do not time-out after a predetermined time. This is necessary if a slow device like a printer is connected.

Providing more the two Serial Ports

The problem
While each of the four Serial Ports defined here has it's own unique I/O addresses, only two IRQ's are assigned. PC hardware and software is not good at sharing IRQ lines and so the above assignments may lead to problems with devices interacting with one another in a way that hinders the proper operation of one or both devices, sharing the same IRQ line. This problem should be overcome with the full introduction of Plug and Play technology but until that happens, to both the hardware and the operating systems, we will have possible trouble with providing more than two Serial (Communication) Ports.
COM Port I/O address (hex) IRQ
1 3F8 IRQ4
2 2F8 IRQ3
3 3E8 IRQ4
4 2E8 IRQ3

While each of the four Serial Ports defined above has it's own unique I/O addresses, only two IRQ's are assigned. PC hardware and software is not good at sharing IRQ lines and so the above assignments may lead to problems with devices interacting with one another in a way that hinders the proper operation of one or both devices, sharing the same IRQ line. This problem should be overcome with the full introduction of Plug and Play technology but until that happens, to both the hardware and the operating systems, we will have possible trouble with providing more than two Serial (Communication) Ports.

Why would you want more than two Serial Ports ?
Serial Ports are used for a wide range of I/O functions and computers used in CAD/CAM installations for example may have a Mouse, a Digitizer, a Plotter and a Modem fitted. The Mouse can make use of the PS/2 Mouse port fitted to most modern System Boards, but this uses one of the Available IRQ's and still leaves us with three Serial Ports required.

Another situations where more than two Serial Ports may be required is where multiple modems are required or devices like Bar Code Readers are in use. This is common if a PC is used as a "point of sale" terminal, and with the cost of PC being so low, they are often the most cost effective way of providing these facilities.

Overcoming the problem
The easiest way to overcome the lack of IRQ assignments is to change either COM1 or COM3 and COM2 or COM4 to alternative IRQ lines. Modern PC Computer hardware has the Serial Ports provided by the Chipset built into the System Board. You can change the I/O addresses and IRQs assigned to these ports from the CMOS setup routines. Many older SPG and FDC/IDE-SPG cards had jumpers to select the I/O addresses and IRQ lines for each port and to turn each I/O function off. You could provide extra Serial Ports, selecting alternative I/O addresses and IRQs, using one of these cards. Only one problem, remember most of these cards were fitted with 16450 UARTs, too slow for modern Telephone Modems. You would have to use these other Serial Ports for devices like the Mouse, and a Digitizer.

Serial Port cards are available that provide one or two extra Serial Ports at all possible I/O addresses and IRQs via jumpers or some form of Soft-setup, but these are quite expensive for what they are.

Possible available IRQ assignments

The IRQ lines are not usually required by the Parallel Ports and will usually be available for other uses. IRQ5 was used for the Hard Disk Controller in an XT type (8 bit bus) computer but is available in modern PC Computers. This means IRQ5 and IRQ7 may be available.

Common default assignments to look out for

Which IRQ lines may be available ?
From what I have said above it is clear you may have IRQ9, IRQ5 and/or IRQ7 available for use with COM3 and COM4. Watch out for what IRQ9 is actually called, IRQ9 in a 16 bit bus DOS computer is wired to the IRQ2 Bus pin, the real IRQ2 is used to cascade a second interrupt controller device. The IRQ9 input on this second device is wired to IRQ2's place on the ISA bus. Windows wants to call this hardware interrupt IRQ9 but some software is quite happy if it is called IRQ2.

In the past IRQ10, IRQ11, IRQ12 and IRQ15 have been available but recent advances in PC technology have lead to these being assigned to standard uses. IRQ10 is often used by Sound Cards, IRQ11 by Network Interface cards, and IRQ12 is used if the PS/2 Mouse Port fitted to most System Boards, is in use.

With the introduction of the second IDE interface channel these days, usually used for interfacing to a CDROM Drive, IRQ15 is assigned to this channel and is no longer available.

How to make use of these extra COM ports
Modern GUI Operating Systems have support for almost any combination of I/O address and IRQ built in and some DOS software packages have a facility to configure the COM ports for I/O address and IRQ line.

The x86 Parallel Port

The Parallel Port:

This port was first provided for a Printer (a hard copy device). DOS and many application programs expect the printer to be a Parallel interfaced device, connected to the first Printer Port, LPT 1. The actual I/O address of LPT1 depends on the hardware present in the computer.

The Parallel Port I/O address assignment

Three addresses are available to the Parallel Ports and at boot-up, the setup routines in the BIOS ROM look for Parallel Ports on the I/O bus, and assigns the LPT numbers, from LPT 1, in this order :-

Officially LPT1 uses I/O address 0378 to 037A but when the BIOS setup routine is looking for Parallel Ports it assigns the first one it finds (in the order given above) as LPT1. The address 03BC to 03BE was first provided by a Parallel Port on IBMs Mono Display Adaptor Video Card but today it is quite common to find this address available on Parallel Port hardware.

The Parallel Ports are assigned an IRQ line as follows.

Port	IRQ
LPT 1	IRQ 7 
LPT 2	IRQ 7 or IRQ 5
In the eight bit PC computer (PC or PC/XT type) IRQ 7 was assigned to both LPT 1 and LPT 2 but in later generation hardware IRQ 5 is assigned to LPT 2.

The IRQ line is not usually used by software communicating with the LPT Ports and so IRQ 7 and IRQ 5 is usually available for other I/O functions. This means IRQ 7 and IRQ5 can be used for some other I/O function. Sound Cards as a rule use either IRQ5, 7 or 10 as the default IRQ.

Parallel Ports can be used for:

The Parallel Port Standard is based on the Centronics Parallel Interface Standard but it has been modified to be bidirectional. Some older Parallel Port hardware in some DOS type computers are not fully bidirectional and these will not work some devices such as Pocket Hard Drives and Tape Backup Drives. The standard Parallel Cable has a DB25P (plug) on the computer end (a socket is used on the computer) and a 36 pin Centronics plug on the printer end. The cable should be shielded and should be no longer than 3 meter. When ASICS chips were first used to provide the Parallel Port, some of these had trouble driving long cables (over 3 m) because they had LSI outputs rather than TTL outputs and they did not like high capacitance loading.

MORE DETAILS ON THE PC PARALLEL PORT

Over the years the Parallel Port on the back of a typical PC Computer, has undergone slow but steady improvement. We now have six types of Parallel Port that have been used over the years.

  1. Unidirectional (4 bit)
  2. Bidirectional (8 bit)
  3. Standard Parallel Port (SPP) also called Type 1
  4. DMA Type 3 (used only by IBM)
  5. Enhanced Parallel Port (EPP)
  6. Enhanced Capability Port (ECP)

By 1994 this development was getting out of hand, and so the IEEE set down standard modes of operation for the Parallel Port, in an document with the title IEEE 1284-1994, Standard Signaling Method for a Bi-directional Parallel Interface for Personal Computers. Before this time there were no set standards as to how the Parallel Port should behave when connected to devices such as Printers, Scanners External Disk Drives etc. The IEEE defined five modes of operation. These modes take care of the various types of hardware that have developed over the years since the PC Computer was released.

  1. Compatibility or Centronics Mode.
  2. Nibble Mode.
  3. Byte Mode
  4. EPP Mode (Enhanced Parallel Port).
  5. ECP Mode (Extended Capabilities Mode)

This IEEE specification is aimed at standardizing the behavior between a PC Computer and an attached device. Although the specification deals mainly with Printers, devices like SCSI Adaptors, CDROM, High Capacity Disk Drive and Tape Backup Adaptors, Optical Scanners and simple LAN interfaces are also covered to some extent.

The Uni-directional (4 bit) Port

When the PC was first designed the Parallel Port was only intended to send data to a printer. Eight Data lines sent eight bit data to the printer, and control lines available in the original Centronics Interface standard, provided for flow control and error signals. All of the 5 control lines from the peripheral to the PC are normally used for external status indications. Using these lines, a peripheral can send a byte of data (8 bits) by sending 2 nibbles (4 bits) of information to the PC in two data transfer cycles.

The Unidirectional (4 bit) Port was capable of data transfer rates of 40 to 60 KB/s in the reverse direction and up to 140 KB/s in the forward direction.

The Bi-direction (8 bit) Port and Standard Parallel Port (SPP)

This was introduced in 1987 with the IBM PS/2 range of computers. Alternative names include , PS/2 type, or Type 1. Data transfer rates as high as 300KB/s can be achieved.

The Bi-directional Parallel Port opened up the way for eight bit communications between the computer and peripheral devices across the Parallel I/O Port. This was done by redefining some unused pins in the Parallel (Centronics) connector, and by defining a Status Bit, used to indicate which direction data was traveling across the interface.

Bidirectional (8 bit DMA) Type 3 Port

The use of a DMA Channel made this port much faster than the Type 1 port covered above. There was also a similar Type 2 port from IBM but this was not used for long and was superseded by the Type 3. IBM was the only company to use Type 2 and 3 Parallel I/O Ports.

The Enhanced Parallel Port (EPP)

The EPP was developed in 1992 by Intel, Xircom and Zenith and is sometimes referred to as the Fast Mode Parallel Port. EPP can operate at close to the ISA Bus speed, providing about ten times the data rate of the older Parallel Port modes. Transfer rates in the order of 500K to 2MBytes per second are possible. This is achieved by allowing the hardware contained in the port to provide flow control, (hand shaking) rather than have the service routines do it.

The IEEE incorporated the EPP standard into its document 1284-1994 but because some minor changes they made to the 1992 version of the standard, we now have two incompatible standards for EPP. There is the original EPP Standards Committee version 1.7, and the IEEE 1284 version. Because the differences were only minor, new peripherals can be designed to cope with the two variations, but older peripherals made to the original EPP 1.7 standard may not work with the newer IEEE 1284 ports.

The Extended Capabilities Mode (ECP)

The Extended Capabilities Mode was jointly designed by Hewlett Packard and Microsoft and announced in 1992. ECP was also included in the IEEE 1284 specification in 1994. Like EPP, ECP uses additional hardware to generate the flow control signals and runs at very much the same speed as an EPP Port. In addition ECP requires a DMA channel to move data about, and uses a FIFO buffer for sending and/or receiving data. The use of a DMA channel can lead to conflicts with other devices that also use DMA and it is often best to choose EPP mode rather than ECP. The rapid adoption of Plug and Play hardware, and Plug and Play aware operating systems like Windows 95, means the DMA channel should no longer be a problem in the near future.

Another feature of ECP is a real time data compression. It uses Run Length Encoding (RLE) to achieve data compression ratio's up to 64:1. This comes is useful with devices such as Optical Scanners and Printers where a good part of the data is long strings which are repetitive.

The Extended Capabilities Port supports a method of channel addressing. This is not intended to be used to daisy chain devices but rather to address multiple devices within one device. Such an example is some of the latest Fax machines on the market. They can be connected to a computer via a Parallel Port and can operate as separate devices such as the Scanner, Modem/Fax and Printer, where each part can be addresses separately, even if the other devices cannot accept data due to full buffers.

Hardware Details of the Parallel Interface Connector

The Parallel Port as implemented in the original PC Computer, consisted of a DB25S connector with 17 signal lines and 8 ground lines. The signal lines are can be divided into three groups:

As originally designed, the Control lines were used as Interface Control and Flow Control (handshaking) signals from the PC to the printer. The Status lines were used for Flow Control signals and as Status Indicators for such things as paper empty, busy indication and interface or peripheral errors. The data lines were used to provide data from the PC to the printer, in that direction only. As we have already said, later implementations of the Parallel Port allowed for data to be driven from the peripheral to the PC.

The original Parallel Interface Port used open collector TTL devices on each side of the interface and these can be damaged by ESD.

The Parallel Ports in modern PC hardware use V.L.S.I. devices that are not open collector devices and these are also easy to damage by ESD. These outputs often do not conform to the TTL standards, and they may have trouble driving older printers, long cables, and external signal-powered devices.

The PC printer cable

The printer cable has a DB25P connector on the "computer end" and a 36 pin Centronics connector on the "printer end". To limit the Radio Frequency Interference (RFI) generated the cable should be a shielded cable with shielded connectors on both ends. The original official limit on cable length was 3 meter but this depends on the type of Parallel Port hardware, some can drive far longer cables.

The Pin-outs for the Parallel Interface cable are as follows.
Line name DB25S 36 pin Centronics Notes
Strobe 1 1 a 1 usec pulse used to clock data into the printer
Data 0 2 2
Data 1 3 3
Data 2 4 4
Data 3 5 5
Data 4 6 6
Data 5 7 7
Data 6 8 8
Data 7 9 9
Acknowledge 10 10 acknowledge signal from printer to computer
Busy 11 11 used by the printer to stop the flow of data
Paper Empty 1212 indicates the printer has run out of paper
Select Out 13 13 indicates the printer is "on line"
Auto Feed 14 14 not often implemented - wired to ground
Error 15 32 indicates a fault in the printer (motor jammed etc)
Initialization 16 31 clears the printers buffers and resets defaults
Select input 17 36 a signal on this line is the same as "select button"
Ground 20 to 25 18 to 25, 16, 19 to 30, 33 18 to 25 are paired with the Data wires pins 2 to 9 as shields

Note - the original specification included plus 5 volt on pin 18 and a "clock signal" from pin 15.

The IEEE 1284 standard specifies 3 different connectors for use with the Parallel Port. The first one (1284 Type A) is the DB25 connector found on the back of most computers, and the second is the (1284 Type B) 36 pin Centronics Connector found on most printers. The third, the IEEE 1284 Type C connector, is also a 36 conductor connector like the Centronics, but it is much smaller. IEEE 1284 Type C also defines two more pins for signals which can be used to see whether the other device connected via it, has power applied.