Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Inside a ferroelectric RAM chip

23 September 2024 at 18:44

Ferroelectric memory (FRAM) is an interesting storage technique that stores bits in a special "ferroelectric" material. Ferroelectric memory is nonvolatile like flash memory, able to hold its data for decades. But, unlike flash, ferroelectric memory can write data rapidly. Moreover, FRAM is much more durable than flash and can be be written trillions of times. With these advantages, you might wonder why FRAM isn't more popular. The problem is that FRAM is much more expensive than flash, so it is only used in niche applications.

Die of the Ramtron FM24C64 FRAM chip. (Click this image (or any other) for a larger version.)

Die of the Ramtron FM24C64 FRAM chip. (Click this image (or any other) for a larger version.)

This post takes a look inside an FRAM chip from 1999, designed by a company called Ramtron. The die photo above shows this 64-kilobit chip under a microscope; the four large dark stripes are the memory cells, containing tiny cubes of ferroelectric material. The horizontal greenish bands are the drivers to select a column of memory, while the vertical greenish band at the right holds the sense amplifiers that amplify the tiny signals from the memory cells. The eight whitish squares around the border of the die are the bond pads, which are connected to the chip's eight pins.1 The logic circuitry at the left and right of the die implements the serial (I2C) interface for communication with the chip.2

The history of ferroelectric memory dates back to the early 1950s.3 Many companies worked on FRAM from the 1950s to the 1970s, including Bell Labs, IBM, RCA, and Ford. The 1955 photo below shows a 256-bit ferroelectric memory built by Bell Labs. Unfortunately, ferroelectric memory had many problems,4 limiting it to specialized applications, and development was mostly abandoned by the 1970s.

A 256-bit ferroelectric memory made by Bell Labs. Photo from Scientific American, June, 1955.

A 256-bit ferroelectric memory made by Bell Labs. Photo from Scientific American, June, 1955.

Ferroelectric memory had a second chance, though. A major proponent of ferroelectric memory was George Rohrer, who started working on ferroelectric memory in 1968. He formed a memory company, Technovation, which was unsuccessful, and then cofounded Ramtron in 1984.5 Ramtron produced a tiny 256-bit memory chip in 1988, followed by much larger memories in the 1990s.

How FRAM works

Ferroelectric memory uses a special material with the property of ferroelectricity. In a normal capacitor, applying an electric field causes the positive and negative charges to separate in the dielectric material, making it polarized. However, ferroelectric materials are special because they will retain this polarization even when the electric field is removed. By polarizing a ferroelectric material positively or negatively, a bit of data can be stored. (The name "ferroelectric" is in analogy to "ferromagnetic", even though ferroelectric materials are not ferrous.)

This FRAM chip uses a ferroelectric material called lead zirconate titanate or PZT, containing lead, zirconium, titanium, and oxygen. The diagram below shows how an applied electric field causes the titanium or zirconium atom to physically move inside the crystal lattice, causing the ferroelectric effect. (Red atoms are lead, purple are oxygen, and yellow are zirconium or titanium.) Because the atoms physically change position, the polarization is stable for decades; in contrast, the capacitors in a DRAM chip lose their data in milliseconds unless refreshed. FRAM memory will eventually wear out, but it can be written trillions of times, much more than flash or EEPROM memory.

The ferroelectric effect in the PZT crystal. From Ramtron Catalog, cleaned up.

The ferroelectric effect in the PZT crystal. From Ramtron Catalog, cleaned up.

To store data, FRAM uses ferroelectric capacitors, capacitors with a ferroelectric material as the dielectric between the plates. Applying a voltage to the capacitor will create an electric field, polarizing the ferroelectric material. A positive voltage will store a 1, and a negative voltage will store a 0.

Reading a bit from memory is a bit tricky. A positive voltage is applied, forcing the material into the 1 state. If the material was already in the 1 state, minimal current will flow. But if the material was in the 0 state, more current will flow as the capacitor changes state. This allows the 0 and 1 states to be distinguished.

Note that reading the bit destroys the stored value. Thus, after a read, the 0 or 1 value must be written back to the capacitor to restore its previous state. (This is very similar to the magnetic core memory that was used in the 1960s.)6

The FRAM chip that I examined uses two capacitors per bit, storing opposite values. This approach makes it easier to distinguish a 1 from a 0: a sense amplifier compares the two tiny signals and generates a 1 or a 0 depending on which is larger. The downside of this approach is that using two capacitors per bit reduces the memory capacity. Later FRAMs increased the density by using one capacitor per bit, along with reference cells for comparison.7

A closer look at the die

The diagram below shows the main functional blocks of the chip.8 The memory itself is partitioned into four blocks. The word line decoders select the appropriate column for the address and the drivers generate the pulses on the word and plate lines. The signals from that column go to the sense amplifiers on the right, where the signals are converted to bits and written back to memory. On the left, the precharge circuitry charges the bit lines to a fixed voltage at the start of the memory cycle, while the decoders select the desired byte from the bit lines.

The die with the main functional blocks labeled.

The die with the main functional blocks labeled.

The diagram below shows a closeup of the memory. I removed the top metal layer and many of the memory cells to reveal the underlying structure. The structure is very three-dimensional compared to regular chips; the gray squares in the image are cubes of PZT, sitting on top of the plate lines. The brown rectangles labeled "top plate connection" are also three-dimensional; they are S-shaped brackets with the low end attached to the silicon and the high end contacting the top of the PZT cube. Thus, each PZT cube forms a capacitor with the plate line forming the bottom plate of the capacitor, the bracket forming the top plate connection, and the PZT cube sandwiched in between, providing the ferroelectric dielectric. (Some cubes have been knocked loose in this photo and are sitting at an angle; the cubes form a regular grid in the original chip.)

Structure of the memory. The image is focus-stacked for clarity.

Structure of the memory. The image is focus-stacked for clarity.

The physical design of the chip is complicated and quite different from a typical planar integrated circuit. Each capacitor requires a cube of PZT sandwiched between platinum electrodes, with the three-dimensional contact from the top of the capacitor to the silicon. Creating these structures requires numerous steps that aren't used in normal integrated circuit fabrication. (See the footnote9 for details.) Moreover, the metal ions in the PZT material can contaminate the silicon production facility unless great care is taken, such as using a separate facility to apply the ferroelectric layer and all subsequent steps.10 The additional fabrication steps and unusual materials significantly increase the cost of manufacturing FRAM.

Each top plate connection has an associated transistor, gated by a vertical word line.11 The transistors are connected to horizontal bit lines, metal lines that were removed for this photo. A memory cell, containing two capacitors, measures about 4.2 µm × 6.5 µm. The PZT cubes are spaced about 2.1 µm apart. The transistor gate length is roughly 700 nm. The 700 nm node was introduced in 1993, while the die contains a 1999 copyright date, so the chip appears to be a few years behind the cutting edge as far as node.

The memory is organized as 256 capacitors horizontally by 512 capacitors vertically, for a total of 64 kilobits (since each bit requires two capacitors). The memory is accessed as 8192 bytes. Curiously, the columns are numbered on the die, as shown below.

With the metal removed, the numbers are visible counting the columns.

With the metal removed, the numbers are visible counting the columns.

The photo below shows the sense amplifiers to the right of the memory, with some large transistors to boost the signal. Each sense amplifier receives two signals from the pair of capacitors holding a bit. The sense amplifier determines which signal is larger, deciding if the bit is a 0 or 1. Because the signals are very small, the sense amplifier must be very sensitive. The amplifier has two cross-connected transistors with each transistor trying to pull the other signal low. The signal that starts off larger will "win", creating a solid 0 or 1 signal. This value is rewritten to memory to restore the value, since reading the value erases the cells. In the photo, a few of the ferroelectric capacitors are visible at the far left. Part of the lower metal layer has come loose, causing the randomly strewn brown rectangles.

The sense amplifiers.

The sense amplifiers.

The photo below shows eight of the plate drivers, below the memory cells. This circuit generates the pulse on the selected plate line. The plate lines are the thick white lines at the top of the image; they are platinum so they appear brighter in the photo than the other metal lines. Most of the capacitors are still present on the plate lines, but some capacitors have come loose and are scattered on the rest of the circuitry. Each plate line is connected to a metal line (brown), which connects the plate line to the drive transistors in the middle and bottom of the image. These transistors pull the appropriate plate line high or low as necessary. The columns of small black circles are connections between the metal line and the silicon of the transistor underneath.

The plate driver circuitry.

The plate driver circuitry.

Finally, here's the part number and Ramtron logo on the die.

Closeup of the logo "FM24C64A Ramtron" on the die.

Closeup of the logo "FM24C64A Ramtron" on the die.

Conclusions

Ferroelectric RAM is an example of a technology with many advantages that never achieved the hoped-for success. Many companies worked on FRAM from the 1950s to the 1970s but gave up on it. Ramtron tried again and produced products but they were not profitable. Ramtron had hoped that the density and cost of FRAM would be competitive with DRAM, but unfortunately that didn't pan out. Ramtron was acquired by Cypress Semiconductor in 2012 and then Cypress was acquired by Infineon in 2019. Infineon still sells FRAM, but it is a niche product, for instance satellites that need radiation hardness. Currently, FRAM costs roughly $3/megabit, almost three orders of magnitude more expensive than flash memory, which is about $15/gigabit. Nonetheless, FRAM is a fascinating technology and the structures inside the chip are very interesting.

For more, follow me on Mastodon as @kenshirriff@oldbytes.space or RSS. (I've given up on Twitter.) Thanks to CuriousMarc for providing the chip, which was used in a digital readout (DRO) for his CNC machine.

Notes and references

  1. The photo below shows the chip's 8-pin package.

    The chip is packaged in an 8-pin DIP. "RIC" stands for Ramtron International Corporation.

    The chip is packaged in an 8-pin DIP. "RIC" stands for Ramtron International Corporation.

     

  2. The block diagram shows the structure of the chip, which is significantly different from a standard DRAM chip. The chip has logic to handle the I2C protocol, a serial protocol that uses a clock and a data line. (Note that the address lines A0-A2 are the address of the chip, not the memory address.) The WP (Write Protect) pin, protects one quarter of the chip from being modified. The chip allows an arbitrary number of bytes to be read or written sequentially in one operation. This is implemented by the counter and address latch.

    Block diagram of the FRAM chip. From the datasheet.

    Block diagram of the FRAM chip. From the datasheet.

     

  3. An early description of ferroelectric memory is in the October 1953 Proceedings of the IRE. This issue focused on computers and had an article on computer memory systems by J. P. Eckert of ENIAC fame. In 1953, computer memory systems were primitive: mercury delay lines, electrostatic CRTs (Williams tubes), or rotating drums. The article describes experimental memory technologies including ferroelectric memory, magnetic core memory, neon-capacitor memory, phosphor drums, temperature-sensitive pigments, corona discharge, or electrolytic diodes. Within a couple of years, magnetic core memory became successful, dominating storage until semiconductor memory took over in the 1970s, and most of the other technologies were forgotten. 

  4. A 1969 article in Electronics discussed ferroelectric memories. At the time, ferroelectric memories were used for a few specialized applications. However, ferroelectric memories had many issues: slow write speed, high voltages (75 to 150 volts), and expensive logic to decode addresses. The article stated: "These considerations make the future of ferroelectric memories in computers rather bleak." 

  5. Interestingly, the "Ram" in Ramtron comes from the initials of the cofounders: Rohrer, Araujo, and McMillan. Rohrer originally focused on potassium nitrate as the ferroelectric material, as described in his patent. (I find it surprising that potassium nitrate is ferroelectric since it seems like such a simple, non-exotic chemical.) An extensive history of Ramtron is here. A Popular Science article also provides information. 

  6. Like core memory, ferroelectric memory is based on a hysteresis loop. Because of the hysteresis loop, the material has two stable states, storing a 0 or 1. While core memory has a hysteresis loop for magnetization with respect to the magnetic field, ferroelectric memory The difference is that core memory has hysteresis of the magnetization with respect to the applied magnetic field, while ferroelectric memory has hysteresis of the polarization with respect to the applied electric field. 

  7. The reference cell approach is described in Ramtron patent 6028783A. The idea is to have a row of reference capacitors, but the reference capacitors are sized to generate a current midway between the 0 current and the 1 current. The reference capacitors provide the second input to the sense amplifiers, allowing the 0 and 1 bits to be distinguished. 

  8. Ramtron's 1987 patent describes the approximate structure of the memory. 

  9. The diagram below shows the complex process that Ramtron used to create an FRAM chip. (These steps are from a 2003 patent, so they may differ from the steps for the chip I examined.)

    Ramtron's process flow to create an FRAM die. From Patent 6613586.

    Ramtron's process flow to create an FRAM die. From Patent 6613586.

    Abbreviations: BPSG is borophosphosilicate glass. UTEOS is undoped tetraethylorthosilicate, a liquid used to deposit silicon dioxide on the surface. RTA is rapid thermal anneal. PTEOS is phosphorus-doped tetraethylorthosilicate, used to create a phosphorus-doped silicon dioxide layer. CMP is chemical mechanical planarization, polishing the die surface to be flat. TEC is the top electrode contact. ILD is interlevel dielectric, the insulating layer between conducting layers. 

  10. See the detailed article Ferroelectric Memories, Science, 1989, by Scott and Araujo (who is the "A" in "Ramtron"). 

  11. Early FRAM memories used an X-Y grid of wires without transistors. Although much simpler, this approach had the problem that current could flow through unwanted capacitors via "sneak" paths, causing noise in the signals and potentially corrupting data. High-density integrated circuits, however, made it practical to associate a transistor with each cell in modern FRAM chips. 

Inside an IBM/Motorola mainframe controller chip from 1981

16 July 2024 at 06:15

In this article, I look inside a chip in the IBM 3274 Control Unit.1 But before I discuss the chip, I need to give some background on mainframes. (I didn't completely analyze the chip, so don't expect a nice narrative or solid conclusions.)

Die photo of the Motorola/IBM SC81150 chip. Click this image (or any other) for a larger version.

Die photo of the Motorola/IBM SC81150 chip. Click this image (or any other) for a larger version.

IBM's vintage mainframes were extremely underpowered compared to modern computers; a System/370 mainframe ran well under 1 million instructions per second, while a modern laptop executes billions of instructions per second. But these mainframes could support rooms full of users, while my 2017 laptop can barely handle one person.2 Mainframes achieved their high capacity by offloading much of the data entry overhead so the mainframe could focus on the "important" work. The mainframe received data directly into memory in bulk over high-speed I/O channels, without needing to handle character-by-character editing. For instance, a typical data entry terminal (a "3270") let the user update fields on the screen without involving the computer. When the user had filled out the screen, pressing the "Enter" key sent the entire data record to the mainframe at once. Thus, the mainframe didn't need to process every keystroke; it only dealt with complete records. (This is also why many modern keyboards have an "Enter" key.)

A room with IBM 3179 Color Display Stations, 1984. Note that these are terminals, not PCs. From 3270 Information Display System Introduction.

A room with IBM 3179 Color Display Stations, 1984. Note that these are terminals, not PCs. From 3270 Information Display System Introduction.

But that was just the beginning of the hierarchy of offloaded processing in a mainframe system. Terminals weren't attached directly to the mainframe. You could wire 16 terminals to a terminal multiplexer (such as the 3299). This would in turn be connected to a 3274 Control Unit that merged the terminal data and handled the network protocols. The Control Unit was connected to the mainframe's channel processor which handled I/O by moving data between memory and peripherals without slowing down the CPU. All these layers allowed the mainframe to focus on the important data processing while the layers underneath dealt with the details.3

An overview of the IBM 3270 Information Display System attachment. The yellow highlights indicate the 3274 Control Unit. From 3270 Information Display System: Introduction.

An overview of the IBM 3270 Information Display System attachment. The yellow highlights indicate the 3274 Control Unit. From 3270 Information Display System: Introduction.

The 3274 Control Unit (highlighted above) is the source of the chip I examined. The purpose of the Control Unit "is to take care of all communication between the host system and your organization's display stations and printers". The diagram above shows how terminals were connected to a mainframe, with the 3274 Control Unit (indicated by arrows) in the middle. The 3274 was an all-purpose box, handling terminals, printers, modems, and encryption (if needed). It could communicate with the mainframe at up to 650,000 characters per second. The control unit below (above) is a boring beige box. The control panel is minimal since people normally didn't interact with the unit. On the back are coaxial connectors for the lines to the terminals, as well as connectors to interface with the computer and other peripherals.

An IBM 3274-41D Control Unit. From bitsavers.

An IBM 3274-41D Control Unit. From bitsavers.

The Keystone II board

In 1983, IBM announced new Control Unit models with twice the speed: these were the Model 41 and Model 61. These units were built around a board called Keystone II, shown below. The board is constructed with IBM's peculiar PCB style. The board is arranged as a grid of squares with the PCB traces too small to see unless you zoom in. Most of the decoupling capacitors are in IBM's thin, rectangular packages, although I see a few capacitors in more standard blue packages. IBM is almost a parallel universe with its unusual packaging for ICs and capacitors as well as the strange circuit board appearance.

The Keystone II board. The box is labeled Keystone II FCS [i.e. First Customer Shipment] July 23, 1982. Photo from bitsavers, originally from Bob Roberts.

The Keystone II board. The box is labeled Keystone II FCS [i.e. First Customer Shipment] July 23, 1982. Photo from bitsavers, originally from Bob Roberts.

Most of the chips on the board are IBM chips packaged in square aluminum cans, known as MST (Monolithic System Technology). The first line on each package is the IBM part number, which is usually undocumented. The empty socket can hold a ROS chip; ROS is Read-Only Store, known as ROM to people outside IBM. The Texas Instruments ICs in the upper right are easier to identify; the 74LS641 chips are octal bus transceivers, presumably connecting this board to the rest of the system. Similarly, the 561 5843 is a 74S240 octal bus driver while the 561 6647 chips are 74LS245 octal bus transceivers.

The memory chips on the left side of this board are interesting: each one consists of two "piggybacked" 16-kilobit DRAM chips. IBM's part number 8279251 corresponds to the Intel 4116 chip, originally made by Mostek. With 18 piggybacked chips, the board holds 64 kilobytes of parity-protected memory.

The photo below shows the Keystone II board mounted in the 3274 Control Unit. The board is in slot E towards the left and the purple Motorola IC is visible.

The Keystone II card in slot E of a 3274-41D Control Unit. Photo from bitsavers.

The Keystone II card in slot E of a 3274-41D Control Unit. Photo from bitsavers.

The Motorola/IBM chip

The board has a Motorola chip in a purple ceramic package; this is the chip that I examined. Popping off the golden lid reveals the silicon die underneath. The package has the part number "SC81150R", indicating a Motorola Special/Custom chip. This part number is also visible on the die, as shown below.

The corner of the die is marked with the SC81150 part number. Bond pads and bond wires are also visible.

The corner of the die is marked with the SC81150 part number. Bond pads and bond wires are also visible.

While the outside of the IC is labeled "Motorola", there are no signs of Motorola internally. Instead, the die is marked "IBM" with the eight-striped logo. My guess is that IBM designed the chip and Motorola manufactured it.

The IBM logo on the die.

The IBM logo on the die.

The diagram below shows the chip with some of the functional blocks identified. Around the outside are the bond pads and the bond wires that are connected to the chip's grid of pins. At the right is the 16×16 block of memory, along with its associated control, byte swap, and output circuitry. The yellowish-white lines are the metal layer on top of the chip that provides the chip's wiring. The thick metal lines distribute power and ground throughout the chip. Unlike modern chips, this chip only has a single metal layer, so power and ground distribution tends to get in the way of useful circuitry.

The die with some functional blocks identified.

The die with some functional blocks identified.

The chip is centered around a 16-bit bus (yellow line) that connects many part of the chip. To write to the bus, a circuit pulls bus lines low. The bus lines are kept high by default by 16 pull-up transistors. This approach was fairly common in the NMOS era. However, performance is limited by the relatively weak pull-up current, making bus lines slow to go high due to R-C delays. For higher performance, some chips would precharge the bus high during one clock cycle and then pull lines low during the next cycle.

The two groups of I/O pins at the bottom are connected to the input buffer on the left and the output buffer on the right. The input buffer includes XOR circuits to compute the parity of each byte. Curiously, only 6 bits of the inputs are connected to the main bus, although other circuits use all 8 bits. The buffer also has a circuit to test for a zero value, but only using 5 of the bits.

I've put red boxes around the numerous PLAs, which can be identified by their grids of transistors. This chip has an unusually large number of PLAs. Eric Schlaepfer hypothesizes that the chip was designed on a prototype circuit board using commercial PAL chips for flexibility, and then they transferred the prototype to silicon, preserving the PLA structure. I didn't see any obvious structure to the PLAs; they all seemed to have wires going all over.

The miscellaneous logic scattered around the chip includes many latches and bus drivers; the latch circuit is similar to the memory cells. I didn't fully reverse-engineer this circuitry but I didn't see anything that looked particularly interesting, such as an ALU or counter. The circuitry near the PLAs could be latches as part of state machines, but I didn't investigate further.

I was hoping to find a recognizable processor inside the package, maybe a Motorola 6809 or 68000 processor. Instead, I found a complicated chip that doesn't appear to be a processor. It has a 16×16 memory block along with about 20 PLAs (Programmable Logic Arrays), a curiously large number. PLAs are commonly used in processors for decoding instructions, since they can match bit patterns. I couldn't find a datapatch in the chip; I expected to see the ALU and registers organized in a large but regular 8-bit or 16-bit block of circuitry. The chip doesn't have any ROM4 so there's no microcode on the chip. For these reasons, I think the chip is not a processor or microcontroller, but a specialized data-handling chip, maybe using the PLAs to interpret bits of a protocol.

The chip is built with NMOS technology, the same as the 6502 and 8086 for instance, rather than CMOS technology that is used in modern chips. I measured the transistor features and the chip appears to be built with a 3.5 µm process (not nm!), which Motorola also used for the 68000 processor (1979).

The memory buffer

The chip has a 16×16 memory buffer, which could be a register file or a FIFO buffer. One interesting feature is that the buffer is triple-ported, so it can handle two reads and one write at the same time. The buffer is implemented as a grid of cells, each storing one bit. Each row corresponds to a 16-bit word, while each column corresponds to one bit in a word. Horizontal control lines (made of polysilicon) select which word gets written or read, while vertical bit lines of metal transmit each bit of the word as it is written or read.

The microscope photo below shows two memory cells. These cells are repeated to create the entire memory buffer. The white vertical lines are metal wiring. The short segments are connections within a cell. The thicker vertical lines are power and ground. The thinner lines are the read and write bit lines. The silicon die itself is underneath the metal. The pinkish regions are active silicon, doped to make it conductive. The speckled golden lines are regions are polysilicon wires between the silicon and the metal. It has two roles: most importantly, when polysilicon crosses active silicon, it forms the gate of a transistor. But polysilicon is also used as wiring, important since this chip only has one layer of metal. The large, dark circles are contacts, connections between the metal layer and the silicon. Smaller square regions are contacts between silicon and polysilicon.

Two memory cells, side by side, as they appear under the microscope.

Two memory cells, side by side, as they appear under the microscope.

It was too difficult to interpret the circuits when they were obscured by the metal layer so I dissolved the metal layer and oxide with hydrochloric acid and Armour Etch respectively. The photo below shows the die with the metal removed; the greenish areas are remnants in areas where the metal was thick, mostly power and ground supplies. The dark regions in this image are regions of doped silicon. These are the active areas of the chip, showing the blocks of circuitry. There are also some thin lines of polysilicon wiring. The memory buffer is the large block on the right, just below the center.

The chip with the metal layer removed. Click to zoom in on the image.

The chip with the metal layer removed. Click to zoom in on the image.

Like most implementations of static RAM, each storage cell of the buffer is implemented with cross-coupled inverters, with the output of one inverter feeding into the input of the other. To write a new value to the cell, the new value simply overpowers the inverter output, forcing the cell to the new state. To support this, one of the inverters is designed to be weak, generating a smaller signal than a regular inverter. Most circuits that I've examined create the inverter by using a weak transistor, one with a longer gate. This chip, however, uses a circuit that I haven't seen before: an additional transistor, configured to limit the current from the inverter.

The schematic below shows one cell. Each cell uses ten transistors, so it is a "10T" cell. To support multiple reads and writes, each row of cells has three horizontal control signals: one to write to the word, and two to read. Each bit position has one vertical bit line to provide the write data and two vertical bit lines for the data that is read. Pass transistors connect the bit lines to the selected cells to perform a read or a write, allowing the data to flow in or out of the cell. The symbol that looks like an op-amp is a two-transistor NMOS buffer to amplify the signal when reading the cell.

Schematic of one memory cell.

Schematic of one memory cell.

With the metal layer removed, it is easier to see the underlying silicon circuitry and reverse-engineer it. The diagram below shows the silicon and polysilicon for one storage cell, corresponding to the schematic above. (Imagine vertical metal lines for power, ground, and the three bitlines.)

One memory cell with the metal layer removed. I etched the die a few seconds too long so some of the polysilicon is very thin or missing.

One memory cell with the metal layer removed. I etched the die a few seconds too long so some of the polysilicon is very thin or missing.

The output from the memory unit contains a byte swapper. A 16-bit word is generated with the left half from the read 1 output and the second half from the read 2 output, but the bytes can be swapped. This was probably used to read an aligned 16-bit word if it was unaligned in memory.

Parity circuits

In the lower right part of the chip are two parity circuits, each computing the parity of an 8-bit input. The parity of an input is computed by XORing the bits together through a tree of 2-input XOR gates. First, four gates process pairs of input bits. Next, two XOR gates combine the outputs of the first gates. Finally, an XOR gate combines the two previous outputs to generate the final parity.

The arrangement of the 14 XOR gates to compute parity of the two 8-bit values A and B.

The arrangement of the 14 XOR gates to compute parity of the two 8-bit values A and B.

The schematic below shows how an XOR gate is built from a NOR gate and an AND-NOR gate. If both inputs are 0, the first NOR gate forces the output to 0. If both inputs are 1, the AND gate forces the output to 0. Thus, the circuit computes XOR. Each labeled block above implements the XOR circuit below.

Schematic of an XOR gate.

Schematic of an XOR gate.

Conclusion

My conclusion is that the processor for the Keystone II board is probably one of the other chips, one of the IBM metal-can MST packages, and this chip helps with data movement in some way. It would be possible to trace out the complete circuitry of the chip and determine exactly how it functions, but that is too time-consuming a project for this relatively obscure chip.

Follow me on Twitter @kenshirriff or RSS for more chip posts. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space. Thanks to Al Kossow for providing the chip and Dag Spicer for providing photos. Thanks to Eric Schlaepfer for discussion.

Notes and references

  1. The 3274 Control Unit was replaced by the 3174 Establishment Controller, introduced in 1986. An "Establishment Controller" managed a cluster of peripherals or PCs connected to a host mainframe, essentially a box that provided a "kitchen-sink" of functionality including terminal support, local disk storage, Ethernet or token-ring networking, ASCII terminal support, encryption/decryption, and modem support. These units ranged from PC-sized boxes to mini-fridge-sized boxes, depending on how much functionality was required. 

  2. I'm serious that my laptop can barely handle one person; my 2017 MacBook Air starts dropping characters if it has even a moderate load, and I have to start one-finger typing. You would think that a 1.8 GHz dual-core i5 processor could handle more than 2 characters per second. I don't know if there's something wrong with it, or if modern software just has too much overhead. Don't worry, I upgraded and do most of my work on a faster, more recent laptop. 

  3. The IBM hardware model had the CPU focusing on the big picture, while the hierarchy of boxes underneath processed data, performed storage, handled printing, and so forth. In a sense, this paralleled the structure of offices in that era, where executives had assistants and secretaries to do the tedious work for them: typing, filing, and so forth. Nowadays, the computer hierarchy and the office hierarchy are both considerably flatter. Maybe there's a connection? 

  4. A ROM and a PLA are similar in many ways. The general distinction is that a ROM activates one word (row) at a time, while a PLA can activate multiple rows at a time and combine the values, giving more flexibility. A ROM generally has a binary decoder to select the row. This decoder can be recognized by its binary structure: transistors alternating by 1's, by 2's, by 4's, and so forth. 

Inside an unusual 7400-series chip implemented with a gate array

26 March 2024 at 23:00

When I look inside a chip from the popular 7400 series, I know what to expect: a fairly simple die, implemented in a straightforward, cost-effective way. However, when I looked inside a military-grade chip built by Integrated Device Technology (IDT)4 I found a very unexpected layout: over 1500 transistors in an orderly matrix. Even stranger, most of the die is wasted: less than 20% of these transistors are used, forming scattered circuits connected by thin metal wires.

In this blog post, I look at this chip in detail, describe its gates, and explain how it implements the "1-of-4" decoder function. I also discuss why it sometimes makes sense to build chips with a gate array design such as this, despite the inefficiency.

A photo of the tiny silicon die in its package.  This chip is the IDT 54FCT139ALB dual 1-of-4 decoder.  Click this image (or any other) for a larger version.

A photo of the tiny silicon die in its package. This chip is the IDT 54FCT139ALB dual 1-of-4 decoder. Click this image (or any other) for a larger version.

In the photo below, you can see the silicon die in more detail, with the silicon appearing pink. The main circuitry is implemented in the nine rows that form the gate array, a grid of 1584 transistors. The tiny dark rectangles are transistors of two types, NMOS and PMOS, that work together to implement CMOS logic circuits. At this scale, the metal wiring is visible as faint gray lines and smudges, but most of the transistors are unconnected. Surrounding the gate array are 22 input/output (I/O) blocks each with a square bond pad. As with the transistors, many of these I/O blocks are unused. Fourteen of these bond pads have tiny metal bond wires (the thick black lines) that connect the silicon die to the chip's external pins. Finally, the pairs of bond wires at the center left and center right provide ground and power connections for the chip.

Closeup die photo.

Closeup die photo.

The photo below zooms in on three rows of circuitry in the chip. The large dark rectangles are pairs of transistors, with two lines of transistors in each row of circuitry. At the top and bottom of each row, the thick horizontal white lines are metal wiring that provides power and ground. In each row, one line of transistors holds PMOS transistors, next to the power wiring, while the other line holds NMOS transistors, next to the ground wiring. (The orientation flips in each successive row, so it isn't obvious which transistors are which unless you check the power connections at the end of the row.)

A closeup of the die.

A closeup of the die.

The transistors are wired into gates by the metal layers, the white lines. The gates are connected by horizontal and vertical wiring using the wiring channels between the rows. This wiring style is very similar to standard-cell logic. However, unlike standard-cell logic, the underlying transistor grid is fixed, resulting in wasted transistors. In the image above, most of the transistors in the middle row are used, while the top row is unused and the bottom row is mostly unused.

The diagram below shows the structure of one of the transistor blocks, which contains two tall, thin MOS transistors. The vertical metal contacts connect to the sources and drains of the transistors, with the two transistors sharing the middle contact. (On an integrated circuit, the source and drain of a transistor are identical, so it is arbitrary which side is the source and which is the drain.) The short horizontal metal contacts at the top connect to the gates of the two transistors; the gates are made of polysilicon, which is barely visible in the die photo. The gates partition the active silicon (green), forming the transistors. The gate width is approximately 1 µm.

A block of two transistors as they appear on the die, along with a diagram showing the structure. The bar indicates a length of 10 µm.

A block of two transistors as they appear on the die, along with a diagram showing the structure. The bar indicates a length of 10 µm.

NAND gate

In this section, I'll explain the construction of one of the NAND gates on the die. The NAND gate below uses four transistors, two NMOS transistors on the top and two PMOS transistors on the bottom. The white lines are the metal wiring, forming two layers. Most of the wiring (including power and ground) is in the lower (M1) layer. The slightly wider and darker vertical segments are the upper (M2) layer. The circles connect the metal layers when they join, or connect the metal layer to the underlying silicon or polysilicon. With two metal layers, it's a bit tricky to see how the wiring is connected. The A and B inputs each connect to two transistor gates. The transistor group at the top is connected to ground on the right, with the output on the left. The transistor group on the bottom is connected to Vcc on the left and right, with the output in the middle. This has the effect of putting the upper transistors in series and the lower transistors in parallel.

A NAND gate on the die.

A NAND gate on the die.

Below, I've drawn the schematic of the NAND gate. On the left, the layout of the schematic matches the die layout above. On the right, I've redrawn the schematic with a more traditional layout. To understand its operation, note that a PMOS transistor (top on the right schematic) turns on when the input is low, while an NMOS transistor (bottom on the right) turns on when the input is high. When both inputs are high, the two NMOS transistors turn on, connecting ground to the output, pulling it low. When either input is low, one of the PMOS transistors turns on, pulling the output high. Thus, the circuit implements the NAND function. The NMOS and PMOS transistors operate in a complementary fashion, giving CMOS (Complementary MOS) its name.

Schematic of a NAND gate.

Schematic of a NAND gate.

NOR gate

In this section, I'll explain the layout of one of the NOR gates on the die, shown below. This gate is twice as large as the previous NAND gate so it can provide twice the output current.1 The NOR gate uses eight transistors, four PMOS transistors in the upper half and four NMOS transistors in the lower half. (Note that Vcc and ground are flipped compared to the previous gate, as are the NMOS and PMOS transistors.) The two transistors in each block are wired in parallel to produce more current for the output. (A out is the same signal as A in, exiting the block at the top to connect to other circuitry.)

A NOR gate on the die.

A NOR gate on the die.

The schematic below shows the wiring of the eight transistors. The schematic layout corresponds to the physical layout to make it easier to map between the image and the schematic. The upper transistor groups are wired in series, while the lower transistor groups are wired in parallel.

Schematic corresponding to the gate above.

Schematic corresponding to the gate above.

The schematic below has been redrawn to make the functionality clearer, and the parallel transistors have been removed. If either input is high, one of the NMOS transistors on the bottom will turn on and pull the input low. If both inputs are low, the two PMOS transistors will turn on and pull the input high. This provides the desired NOR function.

Simplified NOR gate schematic.

Simplified NOR gate schematic.

Note that the NAND and NOR gates have similar but opposite schematics. In the NAND gate, the NMOS transistors are in series while the PMOS transistors are in parallel. In the NOR gate, the roles of the transistors are swapped.

The chip's circuit

The chip I examined is a "dual 1-of-4 decoder with enable".2 The decoding function takes a two-bit input and selects one of four output lines depending on the binary value. The enable line must be low to activate this operation; otherwise all four output lines are disabled. The chip contains two of these decoders, which is why it is called a dual decoder. In total, the chip contains 18 logic gates,3 so it is very simple, even by 1990s standards.

I reverse-engineered the chip and created the schematic below, showing one of the dual units. Each NAND gate matches one of the four input possibilities to drive one of the four outputs. The NOR gates support the ENABLE signal, blocking the outputs unless ENABLE is active (i.e. low).

Reverse-engineered schematic of half the chip.

Reverse-engineered schematic of half the chip.

The chip uses a general-purpose I/O block (below) for each pin, that can be used as an input or an output depending on how it is wired. Each block contains two large drive transistors: an NMOS transistor to pull the output low and a PMOS transistor to pull the output high. The I/O block has separate control lines for the two output transistors. (At the bottom of the image below, two thin metal wires drive the high-side and low-side transistors.) This permits tri-state logic: if neither transistor is energized, the output is left floating. The gate array drives the output transistors with high-current inverter, constructed from multiple transistors in parallel. (This is why the schematic shows more inverters than may seem necessary.)

One of the 22 I/O blocks on the die. Each I/O block is associated with a bond pad, where a bond wire can be connected to an external pin.

One of the 22 I/O blocks on the die. Each I/O block is associated with a bond pad, where a bond wire can be connected to an external pin.

When used as an input, the pad is wired to the surrounding circuitry slightly differently, connecting to input protection diodes (not shown on the schematic). Thus, the functionality of the I/O blocks can be changed by modifying the metal layers, without changing the underlying silicon.

Some 7400-series history

The earliest logic integrated circuits used resistors and transistors internally, so they were called RTL (Resistor Transistor Logic), but RTL had significant performance problems. RTL was rapidly replaced by Diode Transistor Logic (DTL) and then Transistor Transistor Logic (TTL). In 1964, Texas Instruments created a line of TTL integrated circuits for military applications called the SN5400 series. This was shortly followed by the commercial-grade SN7400 series.

The 7400 series of integrated circuits was inexpensive, fast, and easy to use. The line started with simple logic circuits such as four NAND gates on a chip, and moved into more complex chips such as counters, shift registers, and ALUs. The 7400 series became very popular in the 1970s and 1980s, used by electronics hobbyists and high-performance minicomputers alike. These chips became essential building blocks and "glue" logic for microcomputers, heavily used in the Apple II for instance.

The original 7400 series branched into dozens of families with different performance characteristics but the same functionality. The 74LS (low-power Schottky) family, for instance, became very popular as it both improved speed and reduced power consumption. In the mid-1970s, 7400-series chips were introduced that used CMOS circuitry instead of TTL for dramatically lower power consumption. This CMOS family, the 74C series, was followed by numerous other CMOS families.

That brings us to the chip I examined, a member of IDT's 74FCT (Fast CMOS TTL-compatible) line of chips, introduced in the mid-1980s. (Specifically, it is in the 54FCT family because it handles a wider temperature range.) These chips used advanced CMOS technology to provide high speed, low power consumption, and as a military option, radiation tolerance.

Conclusions

Why would you make a chip in this inefficient way, using a gate array that wastes most of the die area? The motivation is that most of the design cost can be shared across many different part types. Each step of integrated circuit processing requires an expensive mask for photolithography. With a gate array, all chip types use the same underlying silicon and transistors, with custom masks just for the two metal layers. In comparison, a fully custom chip might require eight custom masks, which costs much more. The tradeoff is that gate array chips are larger so the manufacturing cost is higher per device.5 Thus, a gate array design is better when selling chips in relatively small quantities, while a custom design is cheaper when mass-producing chips.6 IDT focused on the high-performance and military market rather than the commodity chip market, so gate arrays were a good fit.

One last thing. The packaging of this chip is very interesting since it is mounted on a multi-chip module. The module also contains two Atmel EEPROMs. Presumably the decoder chip decodes address bits to select one of the EEPROMs.

The multi-chip module containing the decoder chip along with an AT28HC64 EPROM on either side.

The multi-chip module containing the decoder chip along with an AT28HC64 EPROM on either side.

Thanks to Don S. for providing the chip. Follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff.

Notes and references

  1. Properly sizing the transistors in a gate is important for performance. Since the transistors in the gate array are all the same size, multiple transistors are used in parallel to get the desired current. The 1999 book Logical Effort describes a methodology for maximizing the performance of CMOS circuits by correctly sizing the transistors. 

  2. The part number is "IDT 54FCT139ALB". "54" indicates the chip operates under an enhanced temperature range of -55°C to +125°C. The "A" indicates the chip is 35% faster than the base series (but not as fast as "C"). "L" indicates the chip is packaged in a leadless chip carrier, the square package shown at the top of the article. Finally, "B" indicates the chip was tested according to military standards: MIL-STD-883, Class B. 

  3. The chip contains 18 logic gates according to the functional schematic in the datasheet (below). The implementation actually uses 52 logic gates by my count (2×26) because the implementation doesn't exactly match the schematic. In particular, the datasheet shows three-input NAND gates, but the chip uses a NAND gate and a NOR gate along with inverters. The chip also has additional inverters to drive the output transistors in each I/O block.

    Schematic of the chip from the datasheet.

    Schematic of the chip from the datasheet.

     

  4. Integrated Device Technology was a spinoff from Hewlett Packard that started in 1980. IDT built advanced CMOS chips including fast static RAM and microprocessors (bit-slice and MIPS). It became part of Renesas in 2018. A very detailed 1986 profile of IDT is here. IDT's logo is pretty cool, combining a chip wafer and calculus.

    The logo of Integrated Device Technology.

    The logo of Integrated Device Technology.

    Here's how the logo looks on the die:

    Closeup of the die showing the IDT logo.

    Closeup of the die showing the IDT logo.

    The die also has the initials of the designers, along with some mysterious symbols. One looks like the Chinese character "正". (Update: based on a Twitter comment, these symbols are probably tally marks, indicating the revision count for each mask.)

    Closeups of two parts of the die.

    Closeups of two parts of the die.
  5. Integrated circuit manufacturing is partitioned into the "front end of line", where the transistors are created on the silicon wafer, and the "back end of line", where the metal wiring is put on top to connect the transistors. With a gate array construction, the front end of line steps create generic gate array wafers. The back end of line steps then connect the transistors as desired for a particular component. The gate array wafers can be produced in large quantities and stored, and then customized for specific products in smaller quantities as needed. This reduces the time to supply a particular chip type since only the back end of line process needs to take place. 

  6. The IDT High-Speed CMOS Logic Design Guide briefly mentions the gate array design. The FCT family was built from two sizes of gate arrays, "4004" for smaller chips and "8000" for larger chips. Later, IDT shrunk the original "Z-step" gate arrays to smaller, higher-performance "Y-step" arrays. They then customized some of the devices to create the "W-step" devices. Looking at the markings on the die, we see that this chip uses the "4004Y" gate array.

    The die shows gate slice 4004Y and part 4139Y (indicating 54139 or 74139). The numbers are slightly obscured by a bond wire.

    The die shows gate slice 4004Y and part 4139Y (indicating 54139 or 74139). The numbers are slightly obscured by a bond wire.

     

Reverse engineering CMOS, illustrated with a vintage Soviet counter chip

28 January 2024 at 17:57
I recently came across an interesting die photo of a Soviet1 chip, probably designed in the 1970s. This article provides an introductory guide to reverse-engineering CMOS circuits, using this chip as an example. Although the chip looks like a tangle of lines at first, its large features and simple layout make it possible to understand its circuits. I'll first explain how to recognize the individual transistors. Groups of transistors are connected in standard patterns to form CMOS gates, multiplexers, flip-flops, and other circuits. Once these building blocks are understood, reverse-engineering the full chip becomes practical. The chip turned out to be a 4-bit CMOS counter, a copy of the Motorola MC14516B.

Die photo of the К561ИЕ11 chip on a wafer. Image courtesy of Martin Evtimov. Click this image (or any other) for a larger version.

Die photo of the К561ИЕ11 chip on a wafer. Image courtesy of Martin Evtimov. Click this image (or any other) for a larger version.

The photo above shows the tiny silicon die under a microscope. Regions of the silicon are doped with impurities to change the silicon's electrical properties. This doping also causes regions of the silicon to appear greenish or reddish, depending on how a region is doped. (These color changes will turn out to be useful for reverse engineering.) On top of the silicon, the whitish metal layer is visible, forming the chip's connections. This chip uses metal-gate transistors, an old technology, so the metal layer also forms the gates of the transistors. Around the outside of the chip, the 16 square bond pads connect the chip to the outside world. When installed in a package, the die has tiny bond wires between the pads and the lead frame, the metal structure that connects to the chip's pins.

According to the Russian datasheet,2 the chip has 319 "elements", presumably counting the semiconductor devices. The chip has a handful of diodes to protect the inputs, so the total transistor count is a bit over 300. This transistor count is nothing compared to a modern CMOS processor with tens of billions of transistors, of course, but most of the circuit principles are the same.

NMOS and PMOS transistors

CMOS is a low-power logic family now used in almost all processors.3 CMOS (complementary MOS) circuitry uses two types of transistors, NMOS and PMOS, working together. The diagram below shows how an NMOS transistor is constructed. The transistor can be considered a switch between the source and drain, controlled by the gate. The source and drain regions (red) consist of silicon doped with impurities to change its semiconductor properties, forming N+ silicon. The gate consists of an aluminum layer, separated from the silicon by a very thin insulating oxide layer.4 (These three layers—Metal, Oxide, Semiconductor—give the MOS transistor its name.) This oxide layer is an insulator, so there is essentially no current flow through the gate, one reason why CMOS is a low-power technology. However, the thin oxide layer is easily destroyed by static electricity, making MOS integrated circuits sensitive to electrostatic discharge.

Structure of an NMOS transistor.

Structure of an NMOS transistor.

A PMOS transistor (below) has the opposite configuration from an NMOS transistor: the source and drain are doped to form P+ regions, while the underlying bulk silicon is N-type silicon. The doping process is interesting, but I'll leave the details to a footnote.5

Structure of a PMOS transistor.

Structure of a PMOS transistor.

The NMOS and PMOS transistors are opposite in their construction and operation; this is the "Complementary" in CMOS. An NMOS transistor turns on when the gate is high, while a PMOS transistor turns on when the gate is low. An NMOS transistor is best at pulling its output low, while a PMOS transistor is best at pulling its output high. In a CMOS circuit, the transistors work as a team, pulling the output high or low as needed. The behavior of MOS transistors is complicated, so this description is simplified, just enough to understand digital circuits.

If you buy an MOS transistor from an electronics supplier, it comes as a package with three leads for the source, gate, and drain. The source and drain are connected differently inside the package and are not interchangeable in a circuit. In an integrated circuit, however, the transistor is symmetrical and the source and drain are the same. For that reason, I won't distinguish between the source and the drain in the following discussion. I will use the symmetrical symbols below for NMOS and PMOS transistors; the inversion bubble on the PMOS gate symbolizes that a low signal activates the PMOS transistor.

Symbols for NMOS and PMOS transistors.

Symbols for NMOS and PMOS transistors.

One complication is that NMOS transistors are built on P-type silicon, while PMOS transistors are built on N-type silicon. Since the silicon die itself is N silicon, the NMOS transistors need to be surrounded by a tub or well of P silicon.6 The cross-section diagram below shows how the NMOS transistor on the right is embedded in the well of P-type silicon. Constructing two transistor types with opposite behaviors makes manufacturing more complex, one reason why CMOS took years to catch on. CMOS was invented in 1963 at Fairchild Semiconductor, but RCA was the main proponent of CMOS, commercializing it in the late 1960s. Although RCA produced a CMOS microprocessor in 1974, mainstream microprocessors didn't switch to CMOS until the mid-1980s with chips such as the Motorola 68020 (1984) and the Intel 386 (1986).

Cross-section of CMOS transistors.

Cross-section of CMOS transistors.

For proper operation, the silicon that surrounds transistors needs to be connected to the appropriate voltage through "tap" contacts.7 For PMOS transistors, the substrate is connected to power through the taps, while for NMOS transistors the well region is connected to ground through the taps. When reverse-engineering, the taps can provide important clues, indicating which regions are NMOS and which are PMOS. As will be seen below, these voltages are also important for understanding the circuitry of this chip.

The die photo below shows two transistors as they appear on the die. The appearance of transistors varies between different integrated circuits, so a first step of reverse engineering is determining how they look in a particular chip. In this IC, a transistor gate can be distinguished by a large rectangular region over the silicon. (In other metal-gate transistors, the gate often has a "bubble" appearance.) The interactions between the metal wiring and the silicon can be distinguished by subtle differences. For the most part, the metal wiring passes over the silicon, isolated by thick insulating oxide. A contact between metal and silicon is recognizable by a smaller oval region that is slightly darker; wires are connected to the transistor sources and drains below. MOS transistors often don't have discrete boundaries; as will be seen later, the source of one transistor can overlap with the drain of another.

Two transistors on the die.

Two transistors on the die.

Distinguishing PMOS and NMOS transistors can be difficult. On this chip, P-type silicon appears greenish, and N-type silicon appears reddish. Thus, PMOS transistors appear as a green region surrounded by red, while NMOS is the opposite. Moreover, PMOS transistors are generally larger than NMOS transistors because they are weaker. Another way to distinguish them is by their connection in circuits. As will be seen below, PMOS transistors in logic gates are connected to power while NMOS transistors are connected to ground.

Metal-gate transistors are a very old technology, mostly replaced by silicon-gate transistors in the 1970s. Silicon-gate circuitry uses an additional layer of polysilicon wiring. Moreover, modern ICs usually have more than one layer of metal. The metal-gate IC in this post is easier to understand than a modern IC, since there are fewer layers to analyze. The CMOS principles are the same in modern ICs, but the layout will appear different.

Implementing an inverter in CMOS

The simplest CMOS gate is an inverter, shown below. Although basic, it illustrates most of the principles of CMOS circuitry. The inverter is constructed from a PMOS transistor on top to pull the output high and an NMOS transistor below to pull the output low. The input is connected to the gates of both transistors.

A CMOS inverter is constructed from a PMOS transistor (top) and an NMOS transistor (bottom).

A CMOS inverter is constructed from a PMOS transistor (top) and an NMOS transistor (bottom).

Recall that an NMOS transistor is turned on by a high signal on the gate, while a PMOS transistor is the opposite, turned on by a low signal. Thus, when the input is high, the NMOS transistor (bottom) turns on, pulling the output low. When the input is low, the PMOS transistor (top) turns on, pulling the output high. Notice how the transistors act in opposite (i.e. complementary) fashion.

How the inverter functions.

How the inverter functions.

An inverter on the die is shown below. The PMOS and NMOS transistors are indicated by red boxes and the transistors are connected according to the schematics above. The input is connected to the gates of the two transistors, which can be distinguished as larger metal rectangles. On the right, two contacts connect the transistor drains to the output. The power and ground connections are a bit different from most chips since the metal lines appear to not go anywhere. The short metal line labeled "power" connects the PMOS transistor's source to the substrate, the reddish silicon that surrounds the transistor. As described earlier, the substrate is connected to the chip's power. Thus, the transistor receives its power through the substrate silicon. This approach isn't optimal, due to the relatively high resistance of silicon, but it simplifies the wiring. Similarly, the ground metal connects the NMOS transistor's source to the well that surrounds the transistor, P-type silicon that appears green. Since the well is grounded, the transistor has its ground connection.

An inverter on the die.

An inverter on the die.

Some inverters look different from the layout above. Many of the chip's inverters are constructed as two inverters in parallel to provide twice the output current. This gives the inverter more "fan-out", the ability to drive the inputs of a larger number of gates.8 The diagram below shows a doubled inverter, which is essentially the previous inverter mirrored and copied, with two PMOS transistors at the top and two NMOS transistors at the bottom. Note that there is no explicit boundary between the paired transistors; their drains share the same silicon. Consequently, each output contact is shared between two transistors, rather than being duplicated.

An inverter consisting of two inverters in parallel.

An inverter consisting of two inverters in parallel.

Another style of inverter drives the chip's output pins. The output pins require high current to drive external circuitry. The chip uses much larger transistors to provide this current. Nonetheless, the output driver uses the same inverter circuit described earlier, with a PMOS transistor to put the output high and an NMOS transistor to pull the output low. The photo below shows one of these output inverters on the die. To fit the larger transistors into the available space, the transistors have a serpentine layout, with the gate winding between the source and the drain. The inverter's output is connected to a bond pad. When the die is mounted in a package, tiny bond wires connect the pads to the external pins.

An output driver is an inverter, built with much larger transistors.

An output driver is an inverter, built with much larger transistors.

NOR and NAND gates

Other logic gates are constructed using the same concepts as the inverter, but with additional transistors. In a NOR gate, the PMOS transistors on top are in series, so the output will be pulled high if all inputs are 0. The NMOS transistors on the bottom are in parallel, so the output will be pulled low if any input is 1. Thus, the circuit implements the NOR function. Again, note the complementary action: the PMOS transistors pull the output high, while the NMOS transistors pull the output low. Moreover, the PMOS transistors are in series, while the NMOS transistors are in parallel. The circuit below is a 3-input NOR gate; different numbers of inputs are supported similarly. (With just one input, the circuit becomes an inverter, as you might expect.)

A 3-input NOR gate implemented in CMOS.

A 3-input NOR gate implemented in CMOS.

For any gate implementation, the input must be either pulled high by the PMOS side, or pulled low by the NMOS side. If both happen simultaneously for some input, power and ground would be shorted, possibly destroying the chip. If neither happens, the output would be floating, which is bad in a CMOS circuit.9 In the NOR gate above, you can see that for any input the output is always pulled either high or low as required. Reverse engineering tip: if the output is not always pulled high or low, you probably made a mistake in either the PMOS circuit or the NMOS circuit.10

The diagram below shows how a 3-input NOR gate appears on the die.11 The transistor gates are the thick vertical metal rectangles; PMOS transistors are on top and NMOS below. The three PMOS transistors are in series between power on the left and the output connection on the right. As with the inverter, the power and ground connections are wired to the bulk silicon, not to the chip's power and ground lines.

A 3-input NOR gate as it is implemented on the die. The "extra" PMOS transistor on the left is part of a different gate.

A 3-input NOR gate as it is implemented on the die. The "extra" PMOS transistor on the left is part of a different gate.

The layout of the NMOS transistors is more complicated because it is difficult to wire the transistors in parallel with just one layer of metal. The output wire connects between the first and second transistors as well as to the third transistor. An unusual feature is the connection of the second and third NMOS transistors to ground is done by a horizontal line of doped silicon (reddish "silicon path" indicated by the dotted line). This silicon extends from the ground metal to the region between the two transistors. Finally, note that the PMOS transistors are much larger than the NMOS transistors. This is both because PMOS transistors are inherently less efficient and because transistors in series need to be lower resistance to avoid degrading the output signal. Reverse-engineering tip: It's often easier to recognize the transistors in series and then use that information to determine which transistors must be in parallel.

A NAND gate is implemented by swapping the roles of the series and parallel transistors. That is, the PMOS transistors are in parallel, while the NMOS transistors are in series. For example, the schematic below shows a 4-input NAND gate. If all inputs are 1, the NMOS transistors will pull the output low. If any input is a 0, the corresponding PMOS transistor will pull the output high. Thus, the circuit implements the NAND function.

A 4-input NAND gate implemented in CMOS.

A 4-input NAND gate implemented in CMOS.

The diagram below shows a four-input NAND gate on the die. In the bottom half, four NMOS transistors are in series, while in the top half, four PMOS transistors are in parallel. (Note that the series and parallel transistors are switched compared to the NOR gate.) As in the NOR gate, the power and ground are provided by metal connections to the bulk silicon (two connections for the power). The parallel PMOS circuit uses a "silicon path" (green) to connect each transistor to the output without intersecting the metal. In the middle, this silicon has a vertical metal line on top; this reduces the resistance of the silicon path. The NMOS transistors are larger than the PMOS transistors in this case because the NMOS transistors are in series.

A four-input NAND gate as it appears on the die.

A four-input NAND gate as it appears on the die.

Complex gates

More complex gates such as AND-NOR (AND-OR-INVERT) can also be constructed in CMOS; these gates are commonly used because they are no harder to build than NAND or NOR gates. The schematic below shows an AND-NOR gate. To understand its construction, look at the paths to ground through the NMOS transistors. The first path is through A, B, and C. If these inputs are all high, the output is low, implementing the AND-INVERT side of the gate. The second path is through D, which will pull the output low by itself, implementing the OR-INVERT side of the gate. You can verify that the PMOS transistors pull the output high in the necessary circumstances. Observe that the D transistor is in series on the PMOS side and in parallel on the NMOS side, again showing the complementary nature of these circuits.

An AND-NOR gate.

An AND-NOR gate.

The diagram below shows this AND-NOR gate on the die, with the four inputs A, B, C, and D, corresponding to the schematic above. This gate has a few tricky layout features. The biggest surprise is that there is half of another gate (a 3-input NOR gate) in the middle of this gate. Presumably, the designers found this arrangement efficient since the other gate also uses inputs A, B, and C. The output of the other gate (D) is an input to the gate we're examining. Ignoring the other gate, the AND-NOR gate has the NMOS transistors in the first column, on top of a reddish band, and the PMOS transistors in the third column, on top of a greenish band. Hopefully you can recognize the transistor gates, the large rectangles connected to A, B, C, and D. Matching the schematic above, there are three NMOS transistors in series on the left, connected to A, B, and C, as well as the D transistor providing a second path between ground and the output. On the PMOS side, the A, B, and C transistors are in parallel, and then connected through the D transistor to the output. The green "silicon path" on the right provides the parallel connection from transistors A and B to transistors C and D. Most of this path is covered by two long metal regions, reducing the resistance. But in order to cross under wires B and C, the metal has a break where the green silicon provides the connection.

An AND-NOR gate on the die.

An AND-NOR gate on the die.

As with the other gates, the power is obtained by a connection to the bulk silicon, bridging the red and green regions. If you look closely, there is a green band ("silicon path") down from the power connection and joining the main green region between the B and C transistors, providing power to those transistors through the silicon. The NMOS transistors, on the other hand, have ground connections at the top and bottom. For this circuit, ground is supplied through solid metal wires at the top and the bottom, rather than a connection to the bulk silicon.

A few principles help when reverse-engineering logic gates. First, because of the complementary nature of CMOS, the output must either be pulled high by the PMOS transistors or pulled low by the NMOS transistors. Thus, one group or the other must be activated for each possible input. This implies that the same inputs must go to both the NMOS and PMOS transistors. Moreover, the structures of the NMOS and PMOS circuits are complementary: where the NMOS transistors are parallel, the PMOS transistors must be in series, and vice versa. In the case of the AND-NOR circuit above, these principles are helpful. For instance, you might not spot the "silicon paths", but since the PMOS half must be complementary to the NMOS half, you know that those connections must exist.

Even complex gates can be reverse-engineered by breaking the NMOS transistors into series and parallel groups, corresponding to AND and OR terms. Note that MOS transistors are inherently inverting, so a single gate will always end with inversion. Thus, you can build an AND-OR-AND-NOR gate for instance, but you can't build an AND gate as a single circuit.

Transmission gate

Another key circuit is the transmission gate. This acts as a switch, either passing a signal through or blocking it. The schematic below shows how a transmission gate is constructed from two transistors, an NMOS transistor and a PMOS transistor. If the enable line is high (i.e. low to the PMOS transistor) both transistors turn on, passing the input signal to the output. The NMOS transistor primarily passes a low signal, while the PMOS transistor passes a high signal, so they work together. If the enable line is low, both transistors turn off, blocking the input signal. The schematic symbol for a transmission gate is shown on the right. Note that the transmission gate is bidirectional; it doesn't have a specific input and output. Examining the surrounding circuitry usually reveals which side is the input and which side is the output.

A transmission gate is constructed from two transistors. The transistors and their gates are indicated. The schematic symbol is on the right.

A transmission gate is constructed from two transistors. The transistors and their gates are indicated. The schematic symbol is on the right.

The photo below shows how a transmission gate appears on the die. It consists of a PMOS transistor at the top and an NMOS transistor at the bottom. Both the enable signal and the complemented enable signal are used, one for the NMOS transistor's gate and one for the PMOS transistor.

A transmission gate on the die, consisting of two transistors.

A transmission gate on the die, consisting of two transistors.

The inverter and transmission gate are both two-transistor circuits, but they can be easily distinguished for reverse engineering. One difference is that an inverter is connected to power and ground, while the transmission gate is unpowered. Moreover, the inverter has one input, while the transmission gate has three inputs (counting the control lines). In the inverter, both transistor gates have the same input, so one transistor turns on at a time. In the transmission gate, however, the gates have opposite inputs, so the transistors turn on or off together.

One useful circuit that can be built from transmission gates is the multiplexer, a circuit that selects one of two (or more) inputs. The multiplexer below selects either input inA or inB and connects it to the output, depending if the selection line selA is high or low respectively. The multiplexer can be built from two transmission gates as shown. Note that the select lines are flipped on the second transmission gate, so one transmission gate will be activated at a time. Multiplexers with more inputs can be built by using more transmission gates with additional select lines.

Schematic symbol for a multiplexer and its implementation with two transmission gates.

Schematic symbol for a multiplexer and its implementation with two transmission gates.

The die photo below shows a block of transmission gates consisting of six PMOS transistors and six NMOS transistors. The labels on the metal lines will make more sense as the reverse engineering progresses. Note that the metal layer provides much of the wiring for the circuit, but not all of it. Much of the wiring is implicit, in the sense that neighboring transistors are connected because the source of one transistor overlaps the drain of another.

A block of transistors implementing multiple transmission gates.

A block of transistors implementing multiple transmission gates.

While this may look like an incomprehensible block of zig-zagging lines, tracing out the transistors will reveal the circuitry (below). The wiring in the schematic matches the physical layout on the die, so the schematic is a bit of a mess. With a single layer of metal for wiring, the layout becomes a bit convoluted to avoid crossing wires. (The only wire crossing in this image is in the upper left for wire X; the signal uses a short stretch of silicon to pass under the metal.)

Schematic of the previous block of transistors.

Schematic of the previous block of transistors.

Looking at the PMOS and NMOS transistors as pairs reveals that the circuit above is a chain of transmission gates (shown below). It's not immediately obvious which wires are inputs and which wires are outputs, but it's a good guess that pairs of transmission gates using the opposite control lines form a multiplexer. That is, inputs A and C are multiplexed to output B, inputs C and E are multiplexed to output D, and so forth. As will be seen, these transmission gates form multiplexers that are part of a flip-flop.

The transistors form six transmission gates.

The transistors form six transmission gates.

Latches and flip-flops

Flip-flops and latches are important circuits, able to hold one bit and controlled by a clock signal. Terminology is inconsistent, but I'll use flip-flop to refer to an edge-triggered device and latch to refer to a level-triggered device. That is, a flip-flop will grab its input at the moment the clock signal goes high (i.e. it uses the clock edge), store it, and provide it as the output, called Q for historical reasons. A latch, on the other hand, will take its input, store it, and output it as long as the clock is high (i.e. it uses the clock level). The latch is considered "transparent", since the input immediately appears on the output if the clock is high.

The distinction between latches and flip-flops may seem pedantic, but it is important. Flip-flops will predictably update once per clock cycle, while latches will keep updating as long as the clock is high. By connecting the output of a flip-flop through an inverter back to the input, you can create a toggle flip-flop, which will flip its state once per clock cycle, dividing the clock by two. (This example will be important shortly.) If you try the same thing with a transparent latch, it will oscillate: as soon as the output flips, it will feed back to the latch input and flip again.

The schematic below shows how a latch can be implemented with transmission gates. When the clock is high, the first transmission gate passes the input through to the inverters and the output. When the clock is low, the second transmission gate creates a feedback loop for the inverters, so they hold their value, providing the latch action. Below, the same circuit is drawn with a multiplexer, which may be easier to understand: either the input or the feedback is selected for the inverters.

A latch implemented from transmission gates. Below, the same circuit is shown with a multiplexer.

A latch implemented from transmission gates. Below, the same circuit is shown with a multiplexer.

An edge-triggered flip-flop can be created by combining two latches in a primary/secondary arrangement. When the clock is low, the input will pass into the primary latch. When the clock switches high, two things happen. The primary latch will hold the current value of the input. Meanwhile, the secondary latch will start passing its input (the value from the primary latch) to its output, and thus the flip-flop output. The effect is that the flip-flop's output will be the value at the moment the clock goes high, and the flip-flop is insensitive to changes at other times. (The primary latch's value can keep changing while the clock is low, but this doesn't affect the flip-flop's output.)

Two latches, combined to form a flip-flop.

Two latches, combined to form a flip-flop.

The flip-flops in the counter chip are based on the above design, but they have two additional features. First, the flip-flop can be loaded with a value under the control of a Preset Enable (PE) signal. Second, the flip-flop can either hold its current value or toggle its value, under the control of a Toggle (T) signal. Implementing these features requires two more multiplexers in the primary latch as shown below. The first multiplexer selects either the inverted output or uninverted output to be fed back into the flip flop, providing the selectable toggle action. The second multiplexer is the latch's standard clocked multiplexer. The third multiplexer allows a "preset" value to be loaded directly into the flip-flop, bypassing the clock. (The preset value is inverted, since there are three inverters between the preset and the output.) The secondary latch is the same as before, except it provides the inverted and non-inverted outputs as feedback, allowing the flip-flop to either hold or toggle its value. This circuit illustrates how more complex flip-flops can be created from the building blocks that we've seen.

Schematic of the toggle flip-flop.

Schematic of the toggle flip-flop.

The gray letters in the schematic above match the earlier multiplexer diagram, showing how the three multiplexers were implemented on the die. The other multiplexer and the inverters are implemented in another block of circuitry. I won't explain that circuitry in detail since it doesn't illustrate any new principles.

Routing in silicon: cross-unders

With just one metal layer for wiring, routing of signals on the chip is difficult and requires careful planning. Even so, there are some cases where one signal must cross another. This is accomplished by using silicon for a "cross-under", allowing a signal to pass underneath metal wiring. These cross-unders are avoided unless necessary because silicon has much higher resistance than metal. Moreover, the cross-under requires additional space on the die.

Three cross-unders on the die.

Three cross-unders on the die.

The images above show three cross-unders. In each one, signals are primarily routed in the metal layer, but a signal passes under the metal using a doped silicon region (which appears green). The first cross-under simply lets one signal cross under the second. The second image shows a signal branching as well as crossing under two signals. The third image shows a cross-under distributing a horizontal signal to the upper and lower halves of the chip, while crossing under multiple horizontal signals. Note the small oval contact between the green silicon region and the horizontal metal line, connecting them. It is easy to miss the small contact and think that the vertical signal is simply crossing under the horizontal signal, rather than branching.

About the chip

The focus of this article is the CMOS reverse engineering process rather than this specific chip, but I'll give a bit of information about the chip. The die has the Cyrillic characters ИЕ11 at the top indicating that the chip is a К561ИЕ11 or К564ИЕ11.12 The Soviet Union came up with a standardized numbering system for integrated circuits in 1968. This system is much more helpful than the American system of semi-random part numbers. In this part number, the 5 indicates a monolithic integrated circuit, while 61 or 64 is the series, specifically commercial-grade or military-grade clones of 4000 series CMOS logic. The character И indicates a digital circuit, while ИЕ is a counter. Thus, the part number systematically indicates that the integrated circuit is a CMOS counter.

The 561ИЕ11 turns out to be a copy of the Motorola MC14516 binary up/down counter.13 Conveniently, the Motorola datasheet provides a schematic (below). I won't explain the schematic in detail, but a quick overview may be helpful. The chip is a four-bit counter that can count up or down, and the heart of the chip is the four toggle flip-flops (red). To count up, a flip-flop is toggled if there is a carry from the lower bits, while counting down toggles a flip-flop if there is a borrow from the lower bits. (Much like base-10 long addition or subtraction.) The AND/NOR gates at the bottom (blue) look complex, but they are just generating the toggle signal T: toggle if the lower bits are all-1's and you're counting up, or if the lower bits are all-0's and you're counting down. The flip-flops can also be loaded in parallel from the P inputs. Additional logic allows the chips to be cascaded to form arbitrarily large counters; the carry-out pin of one chip is connected to the carry-in of the next.

Logic diagram of the MC14516 up/down counter chip, from the datasheet.

Logic diagram of the MC14516 up/down counter chip, from the datasheet.

I've labeled the die photo below with the pin functions and the functional blocks. Each quadrant of the chip handles one bit of the counter in a roughly symmetrical way. This quadrant layout accounts for the pin arrangement which otherwise appears semi-random with bits 3 and 0 on one side and bits 2 and 1 on the other, with inputs and output pins jumbled together. The toggle and carry logic is squeezed into the top and middle of the chip. You may recognize the large inverters next to each output pin. When reverse-engineering, look for large transistors next to pads to determine which pins are outputs.

The die with pins and functional blocks labeled.

The die with pins and functional blocks labeled.

Conclusions

This article has discussed the basic circuits that can be found in a CMOS chip. Although the counter chip is old and simple, later chips use the same principles. An important change in later chips is the introduction of silicon-gate transistors, which use polysilicon for the transistor gates and for an additional wiring layer. The circuits are the same, but you need to be able to recognize the polysilicon layer. Many chips have more than one metal layer, which makes it very hard to figure out the wiring connections. Finally, when the feature size approaches the wavelength of light, optical microscopes break down. Thus, these reverse-engineering techniques are only practical up to a point. Nonetheless, many interesting CMOS chips can be studied and reverse-engineered.

For more, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon as @kenshirriff@oldbytes.space. Thanks to Martin Evtimov for providing the die photos.

Notes and references

  1. I'm not sure of the date and manufacturing location of the chip. I think the design is old, from the Soviet Union. (Motorola introduced the MC14516 around 1972 but I don't know when it was copied.) The wafer is said to be scrap from a Ukrainian manufacturer so it may have been manufactured more recently. The die has a symbol that might be a manufacturing logo, but nobody on Twitter could identify it.

    A symbol that appears on the die.

    A symbol that appears on the die.

     

  2. For more about this chip, the Russian databook can be downloaded here; see Volume 5 page 501. 

  3. Early CMOS microprocessors include the 8-bit RCA 1802 COSMAC (1974) and the 12-bit Intersil 6100 (1974). The 1802 is said to be the first CMOS microprocessor. Mainstream microprocessors didn't switch to CMOS until the mid-1980s. 

  4. The chip in this article has metal-gate transistors, with aluminum forming the transistor gate. These transistors were not as advanced as the silicon-gate transistors that were developed in the late 1960s. Silicon gate technology was much better in several ways. First, silicon-gate transistors were smaller, faster, more reliable, and used lower voltages. Second, silicon-gate chips have a layer of polysilicon wiring in addition to the metal wiring; this made chip layouts about twice as dense. 

  5. To produce N-type silicon, the silicon is doped with small amounts of an element such as phosphorus or arsenic. In the periodic table, these elements are one column to the right of silicon so they have one "extra" electron. The free electrons move through the silicon, carrying charge. Because electrons are negative, this type of silicon is called N-type. Conversely, to produce P-type silicon, the silicon is doped with small quantities of an element such as boron. Since boron is one column to the left of silicon in the periodic table, it has one fewer free electrons. A strange thing about semiconductor physics is that the missing electrons (called holes) can move around the silicon much like electrons, but carrying positive charge. Since the charge carriers are positive, this type of silicon is called P-type. For various reasons, electrons carry charge better than holes, so NMOS transistors work better than PMOS transistors. As a result, PMOS transistors need to be about twice the size of comparable NMOS transistors. This quirk is useful for reverse engineering, since it can help distinguish NMOS and PMOS transistors.

    The amount of doping required can be absurdly small, 20 atoms of boron for every billion atoms of silicon in some cases. A typical doping level for N-type silicon is 1015 atoms of phosphorus or arsenic per cubic centimeter, which sounds like a lot until you realize that pure silicon consists of 5×1022 atoms per cubic centimeter. A heavily doped P+ region might have 1020 dopant atoms per cubic centimeter, one atom of boron per 500 atoms of silicon. (Doping levels are described here.) 

  6. This chip is built on a substrate of N-type silicon, with wells of P-type silicon for the NMOS transistors. Chips can be built the other way around, starting with P-type silicon and putting wells of N-type silicon for the PMOS transistors. Another approach is the "twin-well" CMOS process, constructing wells for both NMOS and PMOS transistors. 

  7. The bulk silicon voltage makes the boundary between a transistor and the bulk silicon act as a reverse-biased diode, so current can't flow across the boundary. Specifically, for a PMOS transistor, the N-silicon substrate is connected to the positive supply. For an NMOS transistor, the P-silicon well is connected to ground. A P-N junction acts as a diode, with current flowing from P to N. But the substrate voltages put P at ground and N at +5, blocking any current flow. The result is that the bulk silicon can be considered an insulator, with current restricted to the N+ and P+ doped regions. If this back bias gets reversed, for example, due to power supply fluctuations, current can flow through the substrate. This can result in "latch-up", a situation where the N and P regions act as parasitic NPN and PNP transistors that latch into the "on" state. This shorts power and ground and can destroy the chip. The point is that the substrate voltages are very important for proper operation of the chip. 

  8. Many inverters in this chip duplicate the transistors to increase the current output. The same effect could be achieved with single transistors with twice the gate width. (That is, twice the height in the diagrams.) Because these transistors are arranged in uniform rows, doubling the transistor height would mess up the layout, so using more transistors instead of changing the size makes sense. 

  9. Some chips use dynamic logic, in which case it is okay to leave the gate floating, neither pulled high nor low. Since the gate resistance is extremely high, the capacitance of a gate will hold its value (0 or 1) for a short time. After a few milliseconds, the charge will leak away, so dynamic logic must constantly refresh its signals before they decay.

    In general, the reason you don't want an intermediate voltage as the input to a CMOS circuit is that the voltage might end up turning the PMOS transistor partially on while also turning the NMOS transistor partially on. The result is high current flow from power to ground through the transistors. 

  10. One of the complicated logic gates on the die didn't match the implementation I expected. In particular, for some inputs, the output is neither pulled high nor low. Tracing the source of these inputs reveals what is going on: the gate takes both a signal and its complement as inputs. Thus, some of the "theoretical" inputs are not possible; these can't be both high or both low. The logic gate is optimized to ignore these cases, making the implementation simpler. 

  11. This schematic explains the physical layout of the 3-input NOR gate on the die, in case the wiring isn't clear. Note that the PMOS transistors are wired in series and the NMOS transistors are in parallel, even though both types are physically arranged in rows.

    The 3-input NOR gate on the die. This schematic matches the physical layout.

    The 3-input NOR gate on the die. This schematic matches the physical layout.

     

  12. The commercial-grade chips and military-grade chips presumably use the same die, but are distinguished by the level of testing. So we can't categorize the die as 561-series or 564-series. 

  13. Motorola introduced the MC14500 series in 1971 to fill holes in the CD4000 series. For more about this series, see A Strong Commitment to Complementary MOS

Interesting double-poly latches inside AMD's vintage LANCE Ethernet chip

31 December 2023 at 18:18

I've studied a lot of chips from the 1970s and 1980s, so I usually know what to expect. But an Ethernet chip from 1982 had something new: a strange layer of yellow wiring on the die. After some study, I learned that the yellow wiring is a second layer of resistive polysilicon, used in the chip's static storage cells and latches.

A closeup of the die of the LANCE chip. The metal has been removed to show the layers underneath.

A closeup of the die of the LANCE chip. The metal has been removed to show the layers underneath.

The die photo above shows a closeup of a latch circuit, with the diagonal yellow stripe in the middle. For this photo, I removed the chip's metal layer so you can see the underlying circuitry. The bottom layer, silicon, appears gray-purple under the microscope, with the active silicon regions slightly darker and bordered in black. On top of the silicon, the pink regions are polysilicon, a special type of silicon. Polysilicon has a critical role in the chip: when it crosses active silicon, polysilicon forms the gate of a transistor. The circles are contacts between the metal layer and the underlying silicon or polysilicon. So far, the components of the chip match most NMOS chips of that time. But what about the bright yellow line crossing the circuit diagonally? That was new to me. This second layer of polysilicon provides resistance. It crosses over the other layers, connected to the silicon at the ends with a complex ring structure.

Why would you want high-resistance wiring in your digital chip? To understand this, let's first look at how a bit can be stored. An efficient way to store a bit is to connect two inverters in a loop, as shown below. Each inverter sends the opposite value to the other inverter, so the circuit will be stable in two states, holding one bit: a 1 or a 0.

Two cross-coupled inverters can store either a 0 or a 1 bit.

Two cross-coupled inverters can store either a 0 or a 1 bit.

But how do you store a new value into the inverter loop? There are a few techniques. One is to use pass transistors to break the loop, allowing a new value to be stored. In the schematic below, if the hold signal is activated, the transistor turns on, completing the loop. But if hold is dropped and load is activated, a new value can be loaded from the input into the inverter loop.

A latch, controlled by pass transistors.

A latch, controlled by pass transistors.

An alternative is to use a weak inverter that produces a low-current output. In this case, the input signal can simply overpower the value produced by the inverter, forcing the loop into a new state. The advantage of this circuit is that it eliminates the "hold" transistor. However, a weak inverter turns out to be larger than a regular inverter, negating much of the space saving.1 (The Intel 386 processor uses this type of latch.)

A latch using a weak inverter.

A latch using a weak inverter.

A third alternative, used in the Ethernet chip, is to use a resistor for the feedback, limiting the current.2 As in the previous circuit, the input can overpower the low feedback current. However, this circuit is more compact since it doesn't require a larger inverter. The resistor doesn't require additional space since it can overlap the rest of the circuitry, as shown in the photo at the top of the article. The disadvantage is that manufacturing the die requires additional processing steps to create the resistive polysilicon layer.

A latch using a resistor for feedback.

A latch using a resistor for feedback.

In the Ethernet chip, this type of latch is used in many circuits. For example, shift registers are built by connecting latches in sequence, controlled by the clock signals. Latches are also used to create binary counters, with the latch value toggled when the lower bits produce a carry.

The SRAM cell

It would be overkill to create a separate polysilicon layer just for a few latches. It turns out that the chip was constructed with AMD's "64K dynamic RAM process". Dynamic RAM uses tiny capacitors to store data. In the late 1970s, dynamic RAM chips started using a "double-poly" process with one layer of polysilicon to form the capacitors and a second layer of polysilicon for transistor gates and wiring (details).

The double-poly process was also useful for shrinking the size of static RAM.3 The Ethernet chip contains several blocks of storage buffers for various purposes. These blocks are implemented as static RAM, including a 22×16 block, a 48×9 block, and a 16×7 block. The photo below shows a closeup of some storage cells, showing how they are arranged in a regular grid. The yellow lines of resistive polysilicon are visible in each cell.

A block of 28 storage cells in the chip. Some of the second polysilicon layer is damaged.

A block of 28 storage cells in the chip. Some of the second polysilicon layer is damaged.

A static RAM storage cell is roughly similar to the latch cell, with two inverters in a loop to store each bit. However, the storage is arranged in a grid: each row corresponds to a particular word, and each column corresponds to the bits in a word. To select a word, a word select line is activated, turning on the pass transistors in that row. Reading and writing the cell is done through a pair of bitlines; each bit has a bitline and a complemented bitline. To read a word, the bits in the word are accessed through the bitlines. To write a word, the new value and its complement are applied to the bitlines, forcing the inverters into the desired state. (The bitlines must be driven with a high-current signal that can overcome the signal from the inverters.)

Schematic of one storage cell.

Schematic of one storage cell.

The diagram below shows the physical layout of one memory cell, consisting of two resistors and four transistors. The black lines indicate the vertical metal wiring that was removed. The schematic on the right corresponds to the physical arrangement of the circuit. Each inverter is constructed from a transistor and a pull-up resistor, and the inverters are connected into a loop. (The role of these resistors is completely different from the feedback resistors in the latch.) The two transistors at the bottom are the pass transistors that provide access to the cell for reads or writes.

One memory cell static memory cell as it appears on the die, along with its schematic.

One memory cell static memory cell as it appears on the die, along with its schematic.

The layout of this storage cell is highly optimized to minimize its area. Note that the yellow resistors take almost no additional area, as they overlap other parts of the cell. If constructed without resistors, each inverter would require an additional transistor, making the cell larger.

To summarize, although the double-poly process was introduced for DRAM capacitors, it can also be used for SRAM cell pull-up resistors. Reducing the size of the SRAM cells was probably the motivation to use this process for the Ethernet chip, with the latch feedback resistors a secondary benefit.

The Am7990 LANCE Ethernet chip

I'll wrap up with some background on the AMD Ethernet chip. Ethernet was invented in 1973 at Xerox PARC and became a standard in 1980. Ethernet was originally implemented with a board full of chips, mostly TTL. By the early 1980s, companies such as Intel, Seeq, and AMD introduced chips to put most of the circuitry onto VLSI chips. These chips reduced the complexity of Ethernet interface hardware, causing the price to drop from $2000 to $1000.

The chip that I'm examining is AMD's Am7990 LANCE (Local Area Network Controller for Ethernet). This chip implemented much of the functionality for Ethernet and "Cheapernet" (now known as 10BASE2 Ethernet). The chip handles serial/parallel conversion, computing the 32-bit CRC checksum, handling collisions and backoff, and recognizing desired addresses. The chip also provides DMA access for interfacing with a microcomputer.

The chip doesn't handle everything, though. It was designed to work with an Am7992 Serial Interface Adapter chip that encodes and decodes the bitstream using Manchester encoding. The third chip was the Am7996 transceiver that handled the low-level signaling and interfacing with the coaxial network cable, as well as detecting collisions if two nodes transmitted at the same time.

The LANCE chip is fairly complicated. The die photo below shows the main functional blocks of the chip. The chip is controlled by the large block of microcode ROM in the lower right. The large dark rectangles are storage, implemented with the static RAM cells described above. The chip has 48 pins, connected by tiny bond wires to the square pads around the edges of the die.

Main functional blocks of the LANCE chip.

Main functional blocks of the LANCE chip.

Thanks to Robert Garner for providing the AMD LANCE chip and information, thanks to a bunch of people on Twitter for discussion, and thanks to Bob Patel for providing the functional block labeling and other information. For more, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space.

Notes and references

  1. It may seem contradictory for a weak inverter to be larger than a regular inverter, since you'd expect that the bigger the transistor, the stronger the signal. It turns out, however, that creating a weak signal requires a larger transistor, due to how MOS transistors are constructed. The current from a transistor is proportional to the gate's width divided by the length. Thus, to create a more powerful transistor, you increase the width. But to create a weak transistor, you can't decrease the width because the minimum width is limited by the manufacturing process. Thus, you need to increase the gate's length. The result is that both stronger and weaker transistors are larger than "normal" transistors. 

  2. You might worry that the feedback resistor will wastefully dissipate power. However, the feedback current is essentially zero because NMOS transistor gates are essentially insulators. Thus, the resistor only needs to pass enough current to charge or discharge the gate. 

  3. An AMD patent describes the double-poly process as well as the static RAM cell; I'm not sure this is the process used in the Ethernet chip, but I expect the process is similar. The diagram below shows the RAM cell with its two resistors. The patent describes how the resistors and second layer of wiring are formed by a silicide/polysilicon ("inverted polycide") sandwich. (The silicide is a low-resistance compound of tantalum and silicon or molybdenum and silicon.) Specifically, the second layer consists of a buffer layer of polysilicon, a thicker silicide layer, and another layer of polysilicon forming the low-resistance "sandwich". Where resistance is desired, the bottom two layers of "sandwich" are removed during fabrication to leave just a layer of polysilicon. This polysilicon is then doped through implantation to give it the desired resistance.

    The static RAM cell from patent 4569122, "Method of forming a low resistance quasi-buried contact".

    The static RAM cell from patent 4569122, "Method of forming a low resistance quasi-buried contact".

    The patent also describes using the second layer of polysilicon to provide a connection between silicon and the main polysilicon layer. Chips normally use a "buried contact" to connect silicon and polysilicon, but the patent describes how putting the second layer of polysilicon on top reduces the alignment requirements for a low-resistance contact. I think this explains the yellow ring of polysilicon around all the silicon/polysilicon contacts in the chip. (These rings are visible in the die photo at the top of the article.) Patent 4581815 refines this process further.

     

The transparent chip inside a vintage Hewlett-Packard floppy drive

20 December 2023 at 06:33

While repairing an eight-inch HP floppy drive, we found that the problem was a broken interface chip. Since the chip was bad, I decapped it and took photos. This chip is very unusual: instead of a silicon substrate, the chip is formed on a base of sapphire, with silicon and metal wiring on top. As a result, the chip is transparent as you can see from the gold "X" visible through the die in the photo below.

The PHI die as seen through an inspection microscope. Click this image (or any other) for a larger version.

The PHI die as seen through an inspection microscope. Click this image (or any other) for a larger version.

The chip is a custom HP chip from 1977 that provides an interface between HP's interface bus (HP-IB) and the Z80 processor in the floppy drive controller. HP designed this interface bus as a low-cost bus to connect test equipment, computers, and peripherals. The chip, named PHI (Processor-to-HP-IB Interface), was used in multiple HP products. It handles the bus protocol and buffered data between the interface bus and a device's microprocessor.1 In this article, I'll take a look inside this "silicon-on-sapphire" chip, examine its metal-gate CMOS circuitry, and explain how it works.

Silicon-on-sapphire

Most integrated circuits are formed on a silicon wafer. Silicon-on-sapphire, on the other hand, starts with a sapphire substrate. A thin layer of silicon is built up on the sapphire substrate to form the circuitry. The silicon is N-type, and is converted to P-type where needed by ion implantation. A metal wiring layer is created on top, forming the wiring as well as the metal-gate transistors. The diagram below shows a cross-section of the circuitry.

Cross-section from HP Journal, April 1977.

Cross-section from HP Journal, April 1977.

The important thing about silicon-on-sapphire is that silicon regions are separated from each other. Since the sapphire substrate is an insulator, transistors are completely isolated, unlike a regular integrated circuit. This reduces the capacitance between transistors, improving performance. The insulation also prevents stray currents, protecting against latch-up and radiation.

An HP MC2 die, illuminated from behind with fiber optics. From Hewlett-Packard Journal, April 1977.

An HP MC2 die, illuminated from behind with fiber optics. From Hewlett-Packard Journal, April 1977.

Silicon-on-sapphire integrated circuits date back to research in 1963 at Autonetics, an innovative but now-forgotten avionics company that produced guidance computers for the Minuteman ICBMs, among other things. RCA developed silicon-on-sapphire integrated circuits in the 1960s and 1970s such as the CDP1821 silicon-on-sapphire 1K RAM. HP used silicon-on-sapphire for multiple chips starting in 1977, such as the MC2 Micro-CPU Chip. HP also used SOS for the three-chip CPU in the HP 3000 Amigo (1978), but the system was a commercial failure. The popularity of silicon-on-sapphire peaked in the early 1980s and HP moved to bulk silicon integrated circuits for calculators such as the HP-41C. Silicon-on-sapphire is still used in various products, such as LEDs and RF applications, but is now mostly a niche technology.

Inside the PHI chip

HP used an unusual package for the PHI chip. The chip is mounted on a ceramic substrate, protected by a ceramic cap. The package has 48 gold fingers that press into a socket. The chip is held into the socket by two metal spring clips.

Package of the PHI chip, showing the underside. The package is flipped over when mounted in a socket.

Package of the PHI chip, showing the underside. The package is flipped over when mounted in a socket.

Decapping the chip was straightforward, but more dramatic than I expected. The chip's cap is attached with adhesive, which can be softened by heating. Hot air wasn't sufficient, so we used a hot plate. Eric tested the adhesive by poking it with an X-Acto knife, causing the cap to suddenly fly off with a loud "pop", sending the blade flying through the air. I was happy to be wearing safety glasses.

Decapping the chip with a hotplate and hot air.

Decapping the chip with a hotplate and hot air.

After decapping the chip, I created the high-resolution die photo below. The metal layer is clearly visible as white lines, while the silicon is grayish and the sapphire appears purple. Around the edge of the die, bond wires connect the chip's 48 external connections to the die. Slightly upper left of center, a large regular rectangular block of circuitry provides 160 bits of storage: this is two 8-word FIFO buffers, passing 10-bit words between the interface bus and a connected microprocessor. The thick metal traces around the edges provide +12 volts, +5 volts, and ground to the chip.

Die photo of the PHI chip, created by stitching together microscope photos. Click for a much larger image.

Die photo of the PHI chip, created by stitching together microscope photos. Click for a much larger image.

Logic gates

Circuitry on this chip has an unusual appearance due to the silicon-on-sapphire implementation as well as the use of metal-gate transistors, but fundamentally the circuits are standard CMOS. The photo below shows a block that implements an inverter and a NAND gate. The sapphire substrate appears as dark purple. On top of this, the thick gray lines are the silicon. The white metal on top connects the transistors. The metal can also form the gates of transistors when it crosses silicon (indicated by letters). Inconveniently, metal that contacts silicon, metal that crosses over silicon, and metal that forms a transistor all appear very similar in this chip. This makes it more difficult to determine the wiring.

This diagram shows an inverter and a NAND gate on the die.

This diagram shows an inverter and a NAND gate on the die.

The schematic below shows how the gates are implemented, matching the photo above. The metal lines at the top and bottom provide the power and ground rails respectively. The inverter is formed from NMOS transistor A and PMOS transistor B; the output goes to transistors D and F. The NAND gate is formed by NMOS transistors E and F in conjunction with PMOS transistors C and D. The components of the NAND gate are joined at the square of metal, and then the output leaves through silicon on the right. Note that signals can only cross when one signal is in the silicon layer and one is in the metal layer. With only two wiring layers, signals in the PHI chip must often meander to avoid crossings, wasting a lot of space. (This wiring is much more constrained than typical chips of the 1970s that also had a polysilicon layer, providing three wiring layers in total.)

This schematic shows how the inverter and a NAND gate are implemented.

This schematic shows how the inverter and a NAND gate are implemented.

The FIFOs

The PHI chip has two first-in-first-out buffers (FIFOs) that occupy a substantial part of the die. Each FIFO holds 8 words of 10 bits, with one FIFO for data being read from the bus and the other for data written to the bus. These buffers help match the bus speed to the microprocessor speed, ensuring that data transmission is as fast as possible.

Each bit of the FIFO is essentially a static RAM cell, as shown below. Inverters A and B form a loop to hold a bit. Pass transistor C provides feedback so the inverter loop remains stable. To write a word, 10 bits are fed through vertical bit-in lines. A horizontal word write signal is activated to select the word to update. This disables transistor C and turns on transistor D, allowing the new bit to flow into the inverter loop. To read a word, a horizontal word read line is activated, turning on pass transistor F. This allows the bit in the cell to flow onto the vertical bit-out line, buffered by inverter E. The two FIFOs have separate lines so they can be read and written independently.

One cell of the FIFO.

One cell of the FIFO.

The diagram below shows nine FIFO cells as they appear on the die. The red box indicates one cell, with components labeled to match the schematic. Cells are mirrored vertically and horizontally to increase the layout density.

Nine FIFO cells as they appear on the die.

Nine FIFO cells as they appear on the die.

Control logic (not shown) to the left and right of the FIFOs manages the FIFOs. This logic generates the appropriate read and write signals so data is written to one end of the FIFO and read from the other end.

The address decoder

Another interesting circuit is the decoder that selects a particular register based on the address lines. The PHI chip has eight registers, selected by three address lines. The decoder takes the address lines and generates 16 control lines (more or less), one to read from each register, and one to write to each register.

A die photo of the address decoder.

A die photo of the address decoder.

The decoder has a regular matrix structure for efficient implementation. Row lines are in pairs, with a line for each address bit input and its complement. Each column corresponds to one output, with the transistors arranged so the column will be activated when given the appropriate inputs. At the top and bottom are inverters. These latch the incoming address bits, generate the complements, and buffer the outputs.

Schematic of the decoder.

Schematic of the decoder.

The schematic above shows how the decoder operates. (I've simplified it to two inputs and two outputs.) At the top, the address line goes through a latch formed from two inverters and a pass transistor. The address line and its complement form two row lines; the other row lines are similar. Each column has a transistor on one row line and a diode on the other, selecting the address for that column. For instance, supposed a0 is 1 and an is 0. This matches the first column since the transistor lines are low and the diode lines are high. The PMOS transistors in the column will all turn on, pulling the input to the inverter high. However, if any of the inputs are "wrong", the corresponding transistor will turn off, blocking the +12 volts. Moreover, the output will be pulled low through the corresponding diode. Thus, each column will be pulled high only if all the inputs match, and otherwise it will be pulled low. Each column output controls one of the chip's registers, allowing that register to be accessed.

The HP-IB bus and the PHI chip

The Hewlett-Packard Interface Bus (HP-IB) was designed in the early 1970s as a low-cost bus for connecting diverse devices including instrument systems (such as a digital voltmeter or frequency counter), storage, and computers. This bus became an IEEE standard in 1975, known as the IEEE-488 bus.2 The bus is 8-bits parallel, with handshaking between devices so slow devices can control the speed.

In 1977, HP Developed a chip, known as PHI (Processor to HP-IB Interface) to implement the bus protocol and provide a microprocessor interface. This chip not only simplified construction of a bus controller but also ensured that devices implemented the protocol consistently. The block diagram below shows the components of the PHI chip. It's not an especially complex chip, but it isn't trivial either. I estimate that it has several thousand transistors.

Block diagram from HP Journal, July 1989.

Block diagram from HP Journal, July 1989.

The die photo below shows some of the functional blocks of the PHI chip. The microprocessor connected to the top pins, while the interface bus connected to the lower pins.

The PHI die with some functional blocks labeled,

The PHI die with some functional blocks labeled,

Conclusions

Top of the PHI chip, with the 1AA6-6004 part number. I'm not sure what the oval stamp at the top is, maybe a turtle?

Top of the PHI chip, with the 1AA6-6004 part number. I'm not sure what the oval stamp at the top is, maybe a turtle?

The PHI chip is interesting as an example of a "technology of the future" that didn't quite pan out. HP put a lot of effort into silicon-on-sapphire chips, expecting that this would become an important technology: dense, fast, and low power. However, regular silicon chips turned out to be the winning technology and silicon-on-sapphire was relegated to niche markets.

Comparing HP's silicon-on-sapphire chips to regular silicon chips at the time shows some advantages and disadvantages. HP's MC2 16-bit processor (1977) used silicon-on-sapphire technology and had 10,000 transistors and ran at 8 megahertz, using 350 mW. In comparison, the Intel 8086 (1978) was also a 16-bit processor, but implemented on regular silicon and using NMOS instead of CMOS. The 8086 had 29,000 transistors, ran at 5 megahertz (at first) and used up to 2.5 watts. The sizes of the chips were almost identical: 34 mm2 for the HP processor and 33 mm2 for the Intel processor. This illustrates that CMOS uses much less power than NMOS, one of the reasons that CMOS is now the dominant technology. For the other factors, silicon-on-sapphire had a bit of a speed advantage but wasn't as dense. Silicon-on-sapphire's main problem was its low yield and high cost. Crystal incompatibilities between silicon and sapphire made manufacturing difficult; HP achieved a yield of 9%, meaning 91% of the dies failed.

The time period of the PHI chip is also interesting since interface buses were transitioning from straightforward buses to high-performance buses with complex protocols. Early buses could be implemented with simple integrated circuits, but as protocols became more complex, custom interface chips became necessary. (The MOS 6522 Versatile Interface Adapter chip (1977) is another example, used in many home computers of the 1980s.) But these interfaces were still simple enough that the interface chips didn't require microcontrollers, using simple state machines instead.

The HP logo on the die of the PHI chip.

The HP logo on the die of the PHI chip.

For more, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space. Thanks to CuriousMarc for providing the chip and to TubeTimeUS for help with decapping.

Notes and references

  1. More information: The article What is Silicon-on-Sapphire discusses the history and construction. Details on the HP-IB bus are here. The HP 12009A HP-IB Interface Reference Manual has information on the PHI chip and the protocol. See also the PHI article from HP Journal, July 1989. EvilMonkeyDesignz also shows a decapped PHI chip. 

  2. People with Commodore PET computers may recognize the IEEE-488 bus since peripherals such as floppy disk drives were connected using the IEEE-488 bus. The cables were generally expensive and harder to obtain than interface cables used by other computers. The devices were also slow compared to other computers, although I think this was due to the hardware, not the bus. 

Reverse engineering the Intel 386 processor's register cell

9 November 2023 at 16:52

The groundbreaking Intel 386 processor (1985) was the first 32-bit processor in the x86 line. It has numerous internal registers: general-purpose registers, index registers, segment selectors, and more specialized registers. In this blog post, I look at the silicon die of the 386 and explain how some of these registers are implemented at the transistor level. The registers that I examined are implemented as static RAM, with each bit stored in a common 8-transistor circuit, known as "8T". Studying this circuit shows the interesting layout techniques that Intel used to squeeze two storage cells together to minimize the space they require.

The diagram below shows the internal structure of the 386. I have marked the relevant registers with three red boxes. Two sets of registers are in the segment descriptor cache, presumably holding cache entries, and one set is at the bottom of the data path. Some of the registers at the bottom are 32 bits wide, while others are half as wide and hold 16 bits. (More registers with different circuits, but I won't discuss them in this post.)

The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici.

The 386 with the main functional blocks labeled. Click this image (or any other) for a larger version. I created this image using a die photo from Antoine Bercovici.

The 6T and 8T static RAM cells

First, I'll explain how a 6T or 8T static cell holds a bit. The basic idea behind a static RAM cell is to connect two inverters into a loop. This circuit will be stable, with one inverter on and one inverter off, and each inverter supporting the other. Depending on which inverter is on, the circuit stores a 0 or a 1.

Two inverters in a loop can store a 0 or a 1.

Two inverters in a loop can store a 0 or a 1.

To write a new value into the circuit, two signals are fed in, forcing the inverters to the desired new values. One inverter receives the new bit value, while the other inverter receives the complemented bit value. This may seem like a brute-force way to update the bit, but it works. The trick is that the inverters in the cell are small and weak, while the input signals are higher current, able to overpower the inverters.1 The write data lines (called bitlines) are connected to the inverters by pass transistors.2 When the pass transistors are on, the signals on the write lines can pass through to the inverters. But when the pass transistors are off, the inverters are isolated from the write lines. Thus, the write control signal enables writing a new value to the inverters. (This signal is called a wordline since it controls access to a word of storage.) Since each inverter consists of two transistors7, the circuit below consists of six transistors, forming the 6T storage cell.

Adding pass transistor so the cell can be written.

Adding pass transistor so the cell can be written.

The 6T cell uses the same bitlines for reading and writing. Adding two transistors creates the 8T circuit, which has the advantage that you can read one register and write to another register at the same time. (I.e. the register file is two-ported.) In the 8T cell below, two additional transistors (G and H) are used for reading. Transistor G buffers the cell's value; it turns on if the inverter output is high, pulling the read output bitline low.3 Transistor H is a pass transistor that blocks this signal until a read is performed on this register; it is controlled by a read wordline.

Schematic of a storage cell. Each transistor is labeled with a letter.

Schematic of a storage cell. Each transistor is labeled with a letter.

To form registers (or memory), a grid is constructed from these cells. Each row corresponds to a register, while each column corresponds to a bit position. The horizontal lines are the wordlines, selecting which word to access, while the vertical lines are the bitlines, passing bits in or out of the registers. For a write, the vertical bitlines provide the 32 bits (along with their complements). For a read, the vertical bitlines receive the 32 bits from the register. A wordline is activated to read or write the selected register.

Static memory cells (8T) organized into a grid.

Static memory cells (8T) organized into a grid.

Silicon circuits in the 386

Before showing the layout of the circuit on the die, I should give a bit of background on the technology used to construct the 386. The 386 was built with CMOS technology, with NMOS and PMOS transistors working together, an advance over the earlier x86 chips that were built with NMOS transistors. Intel called this CMOS technology CHMOS-III (complementary high-performance metal-oxide-silicon), with 1.5 µm features. While Intel's earlier chips had a single metal layer, CHMOS-III provided two metal layers, making signal routing much easier.

Because CMOS uses both NMOS and PMOS transistors, fabrication is more complicated. In an MOS integrated circuit, a transistor is formed where a polysilicon wire crosses active silicon, creating the transistor's gate. A PMOS transistor is constructed directly on the silicon substrate (which is N-doped). However, an NMOS transistor is the opposite, requiring a P-doped substrate. This is created by forming a P well, a region of P-doped silicon that holds NMOS transistors. Each P well must be connected to ground; this is accomplished by connecting ground to specially-doped regions of the P well, called "well taps"`.

The diagram below shows a cross-section through two transistors, showing the layers of the chip. There are four important layers: silicon (which has some regions doped to form active silicon), polysilicon for wiring and transistors, and the two metal layers. At the bottom is the silicon, with P or N doping; note the P-well for the NMOS transistor on the left. Next is the polysilicon layer. At the top are the two layers of metal, named M1 and M2. Conceptually, the chip is constructed from flat layers, but the layers have a three-dimensional structure influenced by the layers below. The layers are separated by silicon dioxide ("ox") or silicon oxynitride4; the oxynitride under M2 caused me considerable difficulty.

A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology.

A cross-section of circuitry formed with the CHMOS-III process. From A double layer metal CHMOS III technology.

The image below shows how circuitry appears on the die;5 I removed the metal layers to show the silicon and polysilicon that form transistors. (As will be described below, this image shows two static cells, holding two bits.) The pinkish and dark regions are active silicon, doped to take part in the circuits, while the "background" silicon can be ignored. The green lines are polysilicon lines on top of the silicon. Transistors are the most important feature here: a transistor gate is formed when polysilicon crosses active silicon, with the source and drain on either side. The upper part of the image has PMOS transistors, while the lower part of the image has the P well that holds NMOS transistors. (The well itself is not visible.) In total, the image shows four PMOS transistors and 12 NMOS transistors. At the bottom, the well taps connect the P well to ground. Although the metal has been removed, the contacts between the lower metal layer (M1) and the silicon or polysilicon are visible as faint circles.

A (heavily edited) closeup of the die.

A (heavily edited) closeup of the die.

Register layout in the 386

Next, I'll explain the layout of these cells in the 386. To increase the circuit density, two cells are put side-by-side, with a mirrored layout. In this way, each row holds two interleaved registers.6 The schematic below shows the arrangement of the paired cells, matching the die image above. Transistors A and B form the first inverter,7 while transistors C and D form the second inverter. Pass transistors E and F allow the bitlines to write the cell. For reading, transistor G amplifies the signal while pass transistor H connects the selected bit to the output.

Schematic of two static cells in the 386. The schematic approximately matches the physical layout.

Schematic of two static cells in the 386. The schematic approximately matches the physical layout.

The left and right sides are approximately mirror images, with separate read and write control lines for each half. Because the control lines for the left and right sides are in different positions, the two sides have some layout differences, in particular, the bulging loop on the right. Mirroring the cells increases the density since the bitlines can be shared by the cells.

The diagram below shows the various components on the die, labeled to match the schematic above. I've drawn the lower M1 metal wiring in blue, but omitted the M2 wiring (horizontal control lines, power, and ground). "Read crossover" indicates the connection from the read output on the left to the bitline on the right. Black circles indicate vias between M1 and M2, green circles indicate contacts between silicon and M1, and reddish circles indicate contacts between polysilicon and M1.

The layout of two static cells. The M1 metal layer is drawn in blue; the horizontal M2 lines are not shown.

The layout of two static cells. The M1 metal layer is drawn in blue; the horizontal M2 lines are not shown.

One more complication is that alternating registers (i.e. rows) are reflected vertically, as shown below. This allows one horizontal power line to feed two rows, and similarly for a horizontal ground line. This cuts the number of power/ground lines in half, making the layout more efficient.

Multiple storage cells.

Multiple storage cells.

Having two layers of metal makes the circuitry considerably more difficult to reverse engineer. The photo below (left) shows one of the static RAM cells as it appears under the microscope. Although the structure of the metal layers is visible in the photograph, there is a lot of ambiguity. It is difficult to distinguish the two layers of metal. Moreover, the metal completely hides the polysilicon layer, not to mention the underlying silicon. The large black circles are vias between the two metal layers. The smaller faint circles are contacts between a metal layer and the underlying silicon or polysilicon.

One cell as it appears on the die, with a diagram of the upper (M2) and lower (M1) metal layers.

One cell as it appears on the die, with a diagram of the upper (M2) and lower (M1) metal layers.

With some effort, I determined the metal layers, which I show on the right: M2 (upper) and M1 (lower). By comparing the left and right images, you can see how the structure of the metal layers is somewhat visible. I use black circles to indicate vias between the layers, green circles indicate contacts between M1 and silicon, and pink circles indicate contacts between M1 and polysilicon. Note that both metal layers are packed as tightly as possible. The layout of this circuit was highly optimized to minimize the area. It is interesting to note that decreasing the size of the transistors wouldn't help with this circuit, since the size is limited by the metal density. This illustrates that a fabrication process must balance the size of the metal features, polysilicon features, and silicon features since over-optimizing one won't help the overall chip density.

The photo below shows the bottom of the register file. The "notch" makes the registers at the very bottom half-width: 4 half-width rows corresponding to eight 16-bit registers. Since there are six 16-bit segment registers in the 386, I suspect these are the segment registers and two mystery registers.

The bottom of the register file.

The bottom of the register file.

I haven't been able to determine which registers in the 386 correspond to the other registers on the die. In the segment descriptor circuitry, there are two rows of register cells with ten more rows below, corresponding to 24 32-bit registers. These are presumably segment descriptors. At the bottom of the datapath, there are 10 32-bit registers with the T8 circuit. The 386's programmer-visible registers consist of eight general-purpose 32-bit registers (EAX, etc.). The 386 has various control registers, test registers, and segmentation registers8 that are not well known. The 8086 has a few registers for internal use that aren't visible to the programmer, so the 386 presumably has even more invisible registers. At this point, I can't narrow down the functionality.

Conclusions

It's interesting to examine how registers are implemented in a real processor. There are plenty of descriptions of the 8T static cell circuit, but it turns out that the physical implementation is more complicated than the theoretical description. Intel put a lot of effort into optimizing this circuit, resulting in a dense block of circuitry. By mirroring cells horizontally and vertically, the density could be increased further.

Reverse engineering one small circuit of the 386 turned out to be pretty tricky, so I don't plan to do a complete reverse engineering. The main difficulty is the two layers of metal are hard to untangle. Moreover, I lost most of the polysilicon when removing the metal. Finally, it is hard to draw diagrams with four layers without the diagram turning into a mess, but hopefully the diagrams made sense.

I plan to write more about the 386, so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space.

Notes and references

  1. Typically the write driver circuit generates a strong low on one of the bitlines, flipping the corresponding inverter to a high output. As soon as one inverter flips, it will force the other inverter into the right state. To support this, the pullup transistors in the inverters are weaker than normal. 

  2. The pass transistor passes its signal through or blocks it. In CMOS, this is usually implemented with a transmission gate with an NMOS and a PMOS transistor in parallel. The cell uses only the NMOS transistor, which makes it worse at passing a high signal, but substantially reduces the size, a reasonable tradeoff for a storage cell. 

  3. The bitline is typically precharged to a high level for a read, and then the cell pulls the line low for a 0. This is more compact than including circuitry in each cell to pull the line high. 

  4. One problem is that the 386 uses a layer of insulating silicon oxynitride as well as the usual silicon dioxide. I was able to remove the oxynitride with boiling phosphoric acid, but this removed most of the polysilicon as well. I'm still experimenting with the timing; 20 minutes of boiling was too long. 

  5. The image is an edited composite of multiple cells since the polysilicon was highly damaged when removing the metal layers. Unfortunately, I haven't found a process for the 386 to remove one layer of metal at a time. As a result, reverse-engineering the 386 is much more difficult than earlier processors such as the 8086; I have to look for faint traces of polysilicon and puzzle over what connections the circuit requires. 

  6. You might wonder why they put two cells side-by-side instead of simply cramming the cells together more tightly. The reason for putting two cells in each row is presumably to match the size of each bit with the rest of the circuitry in the datapath. If the register circuitry is half the width of the ALU circuitry, a bunch of space will be wasted by the wiring to line up each register bit with the corresponding ALU bit. 

  7. A CMOS inverter is constructed from an NMOS transistor (which pulls the output low on a 1 input) and a PMOS transistor (which pulls the output high on a 0 input), as shown below.

    A CMOS inverter.

    A CMOS inverter.

     

  8. The 386 has multiple registers that are documented but not well known. Chapter 4 of the 386 Programmers Reference Manual discusses various registers that are only relevant to operating systems programmers. These include the Global Descriptor Table Register (GDTR), Local Descriptor Table Register (LDTR), Interrupt Descriptor Table Register (IDTR), and Task Register (TR). There are four Control Registers CR0-CR3; CR0 controls coprocessor usage, paging, and a few other things. The six Debug Registers for hardware breakpoints are named DR0-DR3, DR6, and DR7 (which suggests undocumented DR4 and DR5 registers). The two Test Registers for TLB testing are named TR6 and TR7 (which suggests undocumented TR0-TR5 registers). I expect that these registers are located near the relevant functional units, rather than part of the processing datapath. 

Reverse-engineering Ethernet backoff on the Intel 82586 network chip's die

31 October 2023 at 15:39
Introduced in 1973, Ethernet is the predominant way of wiring computers together. Chips were soon introduced to handle the low-level aspects of Ethernet: converting data packets into bits, implementing checksums, and handling network collisions. In 1982, Intel announced the i82586 Ethernet LAN coprocessor chip, which went much further by offloading most of the data movement from the main processor to an on-chip coprocessor. Modern Ethernet networks handle a gigabit of data per second or more, but at the time, the Intel chip's support for 10 Mb/s Ethernet put it on the cutting edge. (Ethernet was surprisingly expensive, about $2000 at the time, but expected to drop under $1000 with the Intel chip.) In this blog post, I focus on a specific part of the coprocessor chip: how it handles network collisions and implements exponential backoff.

The die photo below shows the i82586 chip. This photo shows the metal layer on top of the chip, which hides the underlying polysilicon wiring and silicon transistors. Around the edge of the chip, square bond pads provide the link to the chip's 48 external pins. I have labeled the function blocks based on my reverse engineering and published descriptions. The left side of the chip is called the "receive unit" and handles the low-level networking, with circuitry for the network transmitter and receiver. The left side also contains low-level control and status registers. The right side is called the "command unit" and interfaces to memory and the main processor. The right side contains a simple processor controlled by a microinstruction ROM.1 Data is transmitted between the two halves of the chip by 16-byte FIFOs (first in, first out queues).

The die of the Intel 82586 with the main functional blocks labeled. Click this image (or any other) for a larger version.

The die of the Intel 82586 with the main functional blocks labeled. Click this image (or any other) for a larger version.

The 82586 chip is more complex than the typical Ethernet chip at the time. It was designed to improve system performance by moving most of the Ethernet processing from the main processor to the coprocessor, allowing the main processor and the coprocessor to operate in parallel. The coprocessor provides four DMA channels to move data between memory and the network without the main processor's involvement. The main processor and the coprocessor communicate through complex data structures2 in shared memory: the main processor puts control blocks in memory to tell the I/O coprocessor what to do, specifying the locations of transmit and receive buffers in memory. In response, the I/O coprocessor puts status blocks in memory. The processor onboard the 82586 chip allows the chip to handle these complex data structures in software. Meanwhile, the transmission/receive circuitry on the left side of the chip uses dedicated circuitry to handle the low-level, high-speed aspects of Ethernet.

Ethernet and collisions

A key problem with a shared network is how to prevent multiple computers from trying to send data on the network at the same time. Instead of a centralized control mechanism, Ethernet allows computers to transmit whenever they want.3 If two computers transmit at the same time, the "collision" is detected and the computers try again, hoping to avoid a collision the next time. Although this may sound inefficient, it turns out to work out remarkably well.4 To avoid a second collision, each computer waits a random amount of time before retransmitting the packet. If a collision happens again (which is likely on a busy network), an exponential backoff algorithm is used, with each computer waiting longer and longer after each collision. This automatically balances the retransmission delay to minimize collisions and maximize throughput.

I traced out a bunch of circuitry to determine how the exponential backoff logic is implemented. To summarize, exponential backoff is implemented with a 10-bit counter to provide a pseudorandom number, a 10-bit mask register to get an exponentially sized delay, and a delay counter to count down the delay. I'll discuss how these are implemented, starting with the 10-bit counter.

The 10-bit counter

A 10-bit counter may seem trivial, but it still takes up a substantial area of the chip. The straightforward way of implementing a counter is to hook up 10 latches as a "ripple counter". The counter is controlled by a clock signal that indicates that the counter should increment. The clock toggles the lowest bit of the counter. If this bit flips from 1 to 0, the next higher bit is toggled. The process is repeated from bit to bit, toggling a bit if there is a carry. The problem with this approach is that the carry "ripples" through the counter. Each bit is delayed by the lower bit, so the bits don't all flip at the same time. This limits the speed of the counter as the top bit isn't settled until the carry has propagated through the nine lower bits.

The counter in the chip uses a different approach with additional circuitry to improve performance. Each bit has logic to check if all the lower bits are ones. If so, the clock signal toggles the bit. All the bits toggle at the same time, rapidly incrementing the counter in response to the clock signals. The drawback of this approach is that it requires much more logic.

The diagram below shows how the carry logic is implemented. The circuitry is optimized to balance speed and complexity. In particular, bits are examined in groups of three, allowing some of the logic to be shared across multiple bits. For instance, instead of using a 9-input gate to examine the nine lower bits, separate gates test bits 0-2 and 3-5.

The circuitry to generate the toggle signals for each bit of the counter.

The circuitry to generate the toggle signals for each bit of the counter.

The implementation of the latches is also interesting. Each latch is implemented with dynamic logic, using the circuit's capacitance to store each bit. The input is connected to the output with two inverters. When the clock is high, the transistor turns on, connecting the inverters in a loop that holds the value. When the clock is low, the transistor turns off. However, the 0 or 1 value will still remain on the input to the first inverter, held by the charge on the transistor's gate. At this time, an input can be fed into the latch, overriding the old value.

The basic dynamic latch circuit.

The basic dynamic latch circuit.

The latch has some additional circuitry to make it useful. To toggle the latch, the output is inverted before feeding it back to the input. The toggle control signal selects the inverted output through another pass transistor. The toggle signal is only activated when the clock is low, ensuring that the circuit doesn't repeatedly toggle, oscillating out of control.

One bit of the counter.

One bit of the counter.

The image below shows how the counter circuit is implemented on the die. I have removed the metal layer to show the underlying transistors; the circles are contacts where the metal was connected to the underlying silicon. The pinkish regions are doped silicon. The pink-gray lines are polysilicon wiring. When polysilicon crosses doped silicon, it creates a transistor. The blue color swirls are not significant; they are bits of oxide remaining on the die.

The counter circuitry on the die.

The counter circuitry on the die.

The 10-bit mask register

The mask register has a particular number of low bits set, providing a mask of length 0 to 10. For instance, with 4 bits set, the mask register is 0000001111. The mask register can be updated in two ways. First, it can be set to length 1-8 with a three-bit length input.5 Second, the mask can be lengthened by one bit, for example going from 0000001111 to 0000011111 (length 4 to 5).

The mask register is implemented with dynamic latches similar to the counter, but the inputs to the latches are different. To load the mask to a particular length, each bit has logic to determine if the bit should be set based on the three-bit input. For example, bit 3 is cleared if the specified length is 0 to 3, and set otherwise. The lengthening feature is implemented by shifting the mask value to the left by one bit and inserting a 1 into the lowest bit.

The schematic below shows one bit of the mask register. At the center is a two-inverter latch as seen before. When the clock is high, it holds its value. When the clock is low, the latch can be loaded with a new value. The "shift" line causes the bit from the previous stage to be shifted in. The "load" line loads the mask bit generated from the input length. The "reset" line clears the mask. At the right is the NAND gate that applies the mask to the count and inverts the result. As will be seen below, these NAND gates are unusually large.

One stage of the mask register.

One stage of the mask register.

The logic to set a mask bit based on the length input is shown below.6 The three-bit "sel" input selects the mask length from 1 to 8 bits; note that the mask0 bit is always set while bits 8 and 9 are cleared.7 Each set of gates energizes the corresponding mask line for the appropriate inputs.

The control logic to enable mask bits based on length.

The control logic to enable mask bits based on length.

The diagram below shows the mask register on the die. I removed the metal layer to show the underlying silicon and polysilicon, so the transistors are visible. On the left are the NAND gates that combine each bit of the counter with the mask. Note that large snake-like transistors; these larger transistors provide enough current to drive the signal over the long bus to the delay counter register at the bottom of the chip. Bit 0 of the mask is always set, so it doesn't have a latch. Bits 8 and 9 of the mask are only set by shifting, not by selecting a mask length, so they don't have mask logic.8

The mask register on the die.

The mask register on the die.

The delay counter register

To generate the pseudorandom exponential backoff, the counter register and the mask register are NANDed together. This generates a number of the desired binary length, which is stored in the delay counter. Note that the NAND operation inverts the result, making it negative. Thus, as the delay counter counts up, it counts toward zero, reaching zero after the desired number of clock ticks.

The implementation of the delay counter is similar to the first counter, so I won't include a schematic. However, the delay counter is attached to the register bus, allowing its value to be read by the chip's CPU. Control lines allow the delay counter's value to pass onto the register bus.

The diagram below shows the locations of the counter, mask, and delay register on the die. In this era, something as simple as a 10-bit register occupied a significant part of the die. Also note the distance between the counter and mask and the delay register at the bottom of the chip. The NAND gates for the counter and mask required large transistors to drive the signal across this large distance.

The die, with counter, mask, and delay register.

The die, with counter, mask, and delay register.

Conclusions

The Intel Ethernet chip provides an interesting example of how a real-world circuit is implemented on a chip. Exponential backoff is a key part of the Ethernet standard. This chip implements backoff with a simple but optimized circuit.9

A high-resolution image of the die with the metal removed. (Click for a larger version.) Some of the oxide layer remains, causing colored regions due to thin-film interference.

A high-resolution image of the die with the metal removed. (Click for a larger version.) Some of the oxide layer remains, causing colored regions due to thin-film interference.

For more chip reverse engineering, follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon occasionally as @kenshirriff@oldbytes.space. Acknowledgments: Thanks to Robert Garner for providing the chip and questions.

Notes and references

  1. I think the on-chip processor is a very simple processor that doesn't match other Intel architectures. It is described as executing microcode. I don't think this is microcode in the usual sense of machine instructions being broken down into microcode. Instead, I think the processor's instructions are primitive, single-clock instructions that are more like micro-instructions than machine instructions. 

  2. The diagram below shows the data structures in shared memory for communication between the main processor and the coprocessor. The Command List specifies the commands that the coprocessor should perform. The Receive Frame area provides memory blocks for incoming network packets.

    A diagram of the 82586 shared memory structures, from the 82586 datasheet.

    A diagram of the 82586 shared memory structures, from the 82586 datasheet.

    I think Intel was inspired by mainframe-style I/O channels, which moved I/O processing to separate processors communicating through memory. Another sign of Intel's attempts to move mainframe technology to microprocessors was the ill-fated iAPX 432 processor, which Intel called a "micro-mainframe." (I discuss the iAPX 432 as part of this blog post.)

     

  3. An alternative approach to networking is token-ring, where the computers in the network pass a token from machine to machine. Only the machine with the token can send a packet on the network, ensuring collision-free transmission. I looked inside an IBM token-ring chip in this post

  4. Ethernet's technique is called CSMA/CD (Carrier Sense Multiple Access with Collision Detection). The idea of Carrier Sense is that the "carrier" signal on the network indicates that the network is in use. Each computer on the network listens for the lack of carrier before transmitting, which avoids most collisions. However, there is still a small chance of collision. (In particular, the speed of light means that there is a delay on a long network between when one computer starts transmitting and when a second computer can detect this transmission. Thus, both computers can think the network is free while the other computer is transmitting. This factor also imposes a maximum length on an Ethernet network segment: if the network is too long, a computer can finish transmitting a packet before the collision occurs, and it won't detect the collision.) Modern Ethernet has moved from the shared network to a star topology that avoids collisions. 

  5. The length of the mask is one more than the three-bit length input. E.g. An input of 7 sets eight mask bits. 

  6. The mask generation logic is a bit tricky to understand. You can try various bit combinations to see how it works. The logic is easier to understand if you apply De Morgan's law to change the NOR gates to AND gates, which also removes the negation on the inputs. 

  7. The control line appears to enable or disable mask selection but its behavior is inexplicably negated on bit 1. 

  8. The circuitry below the counter appears to be a state machine that is unrelated to the exponential backoff. From reverse engineering, my hypothesis is that the counter is reused by the state machine: it both generates pseudorandom numbers for exponential backoff and times events when a packet is being received. In particular, it has circuitry to detect when the counter reaches 9, 20, and 48, and takes actions at these values.

    The state itself is held in numerous latches. The new state is computed by a PLA (Programmable Logic Array) below and to the right of the counter along with numerous individual gates. 

  9. One drawback of this exponential backoff circuit is that the pseudorandom numbers are completely synchronous. If two network nodes happen to be in the exact same counter state when they collide, they will go through the same exponential backoff delays, causing a collision every time. While this may seem unlikely, it apparently happened occasionally during use. The LANCE Ethernet chip from AMD used a different approach. Instead of running the pseudorandom counter from the highly accurate quartz clock signal, the counter used an on-chip ring oscillator that was deliberately designed to be inaccurate. This prevented two nodes from locking into inadvertent synchronization. 

A close look at the 8086 processor's bus hold circuitry

5 August 2023 at 16:39

The Intel 8086 microprocessor (1978) revolutionized computing by founding the x86 architecture that continues to this day. One of the lesser-known features of the 8086 is the "hold" functionality, which allows an external device to temporarily take control of the system's bus. This feature was most important for supporting the 8087 math coprocessor chip, which was an option on the IBM PC; the 8087 used the bus hold so it could interact with the system without conflicting with the 8086 processor.

This blog post explains in detail how the bus hold feature is implemented in the processor's logic. (Be warned that this post is a detailed look at a somewhat obscure feature.) I've also found some apparently undocumented characteristics of the 8086's hold acknowledge circuitry, designed to make signal transition faster on the shared control lines.

The die photo below shows the main functional blocks of the 8086 processor. In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured. The 8086 is partitioned into a Bus Interface Unit (upper) that handles bus traffic, and an Execution Unit (lower) that executes instructions. The two units operate mostly independently, which will turn out to be important. The Bus Interface Unit handles read and write operations as requested by the Execution Unit. The Bus Interface Unit also prefetches instructions that the Execution Unit uses when it needs them. The hold control circuitry is highlighted in the upper right; it takes a nontrivial amount of space on the chip. The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins. I've labeled the MN/MX, HOLD, and HLDA pads; these are the relevant signals for this post.

The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.

The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.

How bus hold works

In an 8086 system, the processor communicates with memory and I/O devices over a bus consisting of address and data lines along with various control signals. For high-speed data transfer, it is useful for an I/O device to send data directly to memory, bypassing the processor; this is called DMA (Direct Memory Access). Moreover, a co-processor such as the 8087 floating point unit may need to read data from memory. The bus hold feature supports these operations: it is a mechanism for the 8086 to give up control of the bus, letting another device use the bus to communicate with memory. Specifically, an external device requests a bus hold and the 8086 stops putting electrical signals on the bus and acknowledges the bus hold. The other device can now use the bus. When the other device is done, it signals the 8086, which then resumes its regular bus activity.

Most things in the 8086 are more complicated than you might expect, and the bus hold feature is no exception, largely due to the 8086's minimum and maximum modes. The 8086 can be designed into a system in one of two ways—minimum mode and maximum mode—that redefine the meanings of the 8086's external pins. Minimum mode is designed for simple systems and gives the control pins straightforward meanings such as indicating a read versus a write. Minimum mode provides bus signals that were similar to the earlier 8080 microprocessor, making migration to the 8086 easier. On the other hand, maximum mode is designed for sophisticated, multiprocessor systems and encodes the control signals to provide richer system information.

In more detail, minimum mode is selected if the MN/MX pin is wired high, while maximum mode is selected if the MN/MX pin is wired low. Nine of the chip's pins have different meanings depending on the mode, but only two pins are relevant to this discussion. In minimum mode, pin 31 has the function HOLD, while pin 30 has the function HLDA (Hold Acknowlege). In maximum mode, pin 31 has the function RQ/GT0', while pin 30 has the function RQ/GT1'.

I'll start by explaining how a hold operation works in minimum mode. When an external device wants to use the bus, it pulls the HOLD pin high. At the end of the current bus cycle, the 8086 acknowledges the hold request by pulling HLDA high. The 8086 also puts its bus output pins into "tri-state" mode, in effect disconnecting them electrically from the bus. When the external device is done, it pulls HOLD low and the 8086 regains control of the bus. Don't worry about the details of the timing below; the key point is that a device pulls HOLD high and the 8086 responds by pulling HLDA high.

This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14.

This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14.

The 8086's maximum mode is more complex, allowing two other devices to share the bus by using a priority-based scheme. Maximum mode uses two bidirectional signals, RQ/GT0 and RQ/GT1.2 When a device wants to use the bus, it issues a pulse on one of the signal lines, pulling it low. The 8086 responds by pulsing the same line. When the device is done with the bus, it issues a third pulse to inform the 8086. The RQ/GT0 line has higher priority than RQ/GT1, so if two devices request the bus at the same time, the RQ/GT0 device wins and the RQ/GT1 device needs to wait.1 Keep in mind that the RQ/GT lines are bidirectional: the 8086 and the external device both use the same line for signaling.

This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16.

This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16.

The bus hold does not completely stop the 8086. The hold operation stops the Bus Interface Unit, but the Execution Unit will continue executing instructions until it needs to perform a read or write, or it empties the prefetch queue. Specifically, the hold signal blocks the Bus Interface Unit from starting a memory cycle and blocks an instruction prefetch from starting.

Bus sharing and the 8087 coprocessor

Probably the most common use of the bus hold feature was to support the Intel 8087 math coprocessor. The 8087 coprocessor greatly improved the performance of floating-point operations, making them up to 100 times faster. As well as floating-point arithmetic, the 8087 supported trigonometric operations, logarithms and powers. The 8087's architecture became part of later Intel processors, and the 8087's instructions are still a part of today's x86 computers.3

The 8087 had its own registers and didn't have access to the 8086's registers. Instead, the 8087 could transfer values to and from the system's main memory. Specifically, the 8087 used the RQ/GT mechanism (maximum mode) to take control of the bus if the 8087 needed to transfer operands to or from memory.4 The 8087 could be installed as an option on the original IBM PC, which is why the IBM PC used maximum mode.

The enable flip-flop

The circuit is built from six flip-flops. The flip-flops are a bit different from typical D flip-flops, so I'll discuss the flip-flop behavior before explaining the circuit.

A flip-flop can store a single bit, 0 or 1. Flip flops are very important in the 8086 because they hold information (state) in a stable way, and they synchronize the circuitry with the processor's clock. A common type of flip-flop is the D flip-flop, which takes a data input (D) and stores that value. In an edge-triggered flip-flop, this storage happens on the edge when the clock changes state from low to high.5 (Except at this transition, the input can change without affecting the output.) The output is called Q, while the inverted output is called Q-bar.

The symbol for the D flip-flop with enable.

The symbol for the D flip-flop with enable.

Many of the 8086's flip-flops, including the ones in the hold circuit, have an "enable" input. When the enable input is high, the flip-flop records the D input, but when the enable input is low, the flip-flop keeps its previous value. Thus, the enable input allows the flip-flop to hold its value for more than one clock cycle. The enable input is very important to the functioning of the hold circuit, as it is used to control when the circuit moves to the next state.

How bus hold is implemented (minimum mode)

I'll start by explaining how the hold circuitry works in minimum mode. To review, in minimum mode the external device requests a hold through the HOLD input, keeping the input high for the duration of the request. The 8086 responds by pulling the hold acknowledge HLDA output high for the duration of the hold.

In minimum mode, only three of the six flip-flops are relevant. The diagram below is highly simplified to show the essential behavior. (The full schematic is in the footnotes.6) At the left is the HOLD signal, the request from the external device.

Simplified diagram of the circuitry for minimum mode.

Simplified diagram of the circuitry for minimum mode.

When a HOLD request comes in, the first flip-flop is activated, and remains activated for the duration of the request. The second flip-flop waits if any condition is blocking the hold request: a LOCK instruction, an unaligned memory access, or so forth. When the HOLD can proceed, the second flip-flop is enabled and it latches the request. The second flip-flop controls the internal hold signal, causing the 8086 to stop further bus activity. The third flip-flop is then activated when the current bus cycle (if any) completes; when it latches the request, the hold is "official". The third flip-flop drives the external HLDA (Hold Acknowledge) pin, indicating that the bus is free. This signal also clears the bus-enabled latch (elsewhere in the 8086), putting the appropriate pins into floating tri-state mode. The key point is that the flip-flops control the timing of the internal hold and the external HLDA, moving to the next step as appropriate.

When the external device signals an end to the hold by pulling the HOLD pin low, the process reverses. The three flip-flops return to their idle state in sequence. The second flip-flop clears the internal hold signal, restarting bus activity. The third flip-flop clears the HLDA pin.7

How bus hold is implemented (maximum mode)

The implementation of maximum mode is tricky because it uses the same circuitry as minimum mode, but the behavior is different in several ways. First, minimum mode and maximum mode operate on opposite polarity: a hold is requested by pulling HOLD high in minimum mode versus pulling a request line low in maximum mode. Moreover, in minimum mode, a request on the HOLD pin triggers a response on the opposite pin (HLDA), while in maximum mode, a request and response are on the same pin. Finally, using the same pin for the request and grant signals requires the pin to act as both an input and an output, with tricky electrical properties.

In maximum mode, the top three flip-flops handle the request and grant on line 0, while the bottom three flip-flops handle line 1. At a high level, these flip-flops behave roughly the same as in the minimum mode case, with the first flip-flop tracking the hold request, the second flip-flop activated when the hold is "approved", and the third flip-flop activated when the bus cycle completes. An RQ 0 input will generate a GT 0 output, while a RQ 1 input will generate a GT 1 output. The diagram below is highly simplified, but illustrates the overall behavior. Keep in mind that RQ 0, GT 0, and HOLD use the same physical pin, as do RQ 1, GT 1, and HLDA.

Simplified diagram of the circuitry for maximum mode.

Simplified diagram of the circuitry for maximum mode.

In more detail, the first flip-flop converts the pulse request input into a steady signal. This is accomplished by configuring the first flip-flop is configured with to toggle on when the request pulse is received and toggle off when the end-of-request pulse is received.10 The toggle action is implemented by feeding the output pulse back to the input, inverted (A); since the flip-flop is enabled by the RQ input, the flip-flop holds its value until an input pulse. One tricky part is that the acknowledge pulse must not toggle the flip-flop. This is accomplished by using the output signal to block the toggle. (To keep the diagram simple, I've just noted the "block" action rather than showing the logic.)

As before, the second flip-flop is blocked until the hold is "authorized" to proceed. However, the circuitry is more complicated since it must prioritize the two request lines and ensure that only one hold is granted at a time. If RQ0's first flip-flop is active, it blocks the enable of RQ1's second flip-flop (B). Conversely, if RQ1's second flip-flop is active, it blocks the enable of RQ0's second flip-flop (C). Note the asymmetry, blocking on RQ0's first flip-flop and RQ1's second flip-flop. This enforces the priority of RQ0 over RQ1, since an RQ0 request blocks RQ1 but only an RQ1 "approval" blocks RQ0.

When the second flip-flop is activated in either path, it triggers the internal hold signal (D).8 As before, the hold request is latched into the third flip-flop when any existing memory cycle completes. When the hold request is granted, a pulse is generated (E) on the corresponding GT pin.9

The same circuitry is used for minimum mode and maximum mode, although the above diagrams show differences between the two modes. How does this work? Essentially, logic gates are used to change the behavior between minimum mode and maximum mode as required. For the most part, the circuitry works the same, so only a moderate amount of logic is required to make the same circuitry work for both. On the schematic, the signal MN is active during minimum mode, while MX is active during maximum mode, and these signals control the behavior.

The "hold ok" circuit

As usually happens with the 8086, there are a bunch of special cases when different features interact. One special case is if a bus hold request comes in while the 8086 is acknowledging an interrupt. In this case, the interrupt takes priority and the bus hold is not processed until the interrupt acknowledgment is completed. A second special case is if the bus hold occurs while the 8086 is halted. In this case, the 8086 issues a second HALT indicator at the end of the bus hold. Yet another special case is the 8086's LOCK prefix, which locks the use of the bus for the following instruction, so a bus hold request is not honored until the locked instruction has completed. Finally, the 8086 performs an unaligned word access to memory by breaking it into two 8-bit bus cycles; these two cycles can't be interrupted by a bus hold.

In more detail, the "hold ok" circuit determines at each cycle if a hold could proceed. There are several conditions under which the hold can proceed:

  • The bus cycle is `T2`, except if an unaligned bus operation is taking place (i.e. a word split into two byte operations), or
  • A memory cycle is not active and a microcode memory operation is not taking place, or
  • A memory cycle is not active and a hold is currently active.

The first case occurs during bus (memory) activity, where a hold request before cycle T2 will be handled at the end of that cycle. The second case allows a hold if the bus is inactive. But if microcode is performing a memory operation, the hold will be delayed, even if the request is just starting. The third case is opposite from the other two: it enables the flip flop so a hold request can be dropped. (This ensures that the hold request can still be dropped in the corner case where a hold starts and then the microcode makes a memory request, which will be blocked by the hold.)

The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.

The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.

An instruction with the LOCK prefix causes the bus to be locked against other devices for the duration of the instruction. Thus, a hold cannot be granted while the instruction is running. This is implemented through a separate path. This logic is between the output of the first (request) flip-flop and the second (accepted) flip-flop, tied into the LOCK signal. Conceptually, it seems that the LOCK signal should block hold-ok and thus block the second (accepted) flip-flop from being enabled. But instead, the LOCK signal blocks the data path, unless the request has already been granted. I think the motivation is to allow dropping of a hold request to proceed uninterrupted. In other words, LOCK prevents a hold from being accepted, but it doesn't prevent a hold from being dropped, and it was easier to implement this in the data path.

The pin drive circuitry

The circuitry for the HOLD/RQ0/GT0 and HLDA/RQ1/GT1 pins is somewhat complicated, since they are used for both input and output. In minimum mode, the HOLD pin is an input, while the HLDA pin is an output. In maximum mode, both pins act as an input, with a low-going pulse from an external device to start or stop the hold. But the 8086 also issues pulses to grant the hold. Pull-up resistors inside the 8086 to ensure that the pins remain high (idle) when unused. Finally, an undocumented active pull-up system restores a pin to a high state after it is pulled low, providing faster response than the resistor.

The schematic below shows the heart of the tri-state output circuit. Each pin is connected to two large output MOSFETs, one to drive the pin high and one to drive the pin low. The transistors have separate control lines; if both control lines are low, both transistors are off and the pin floats in the "tri-state" condition. This permits the pin to be used as an input, driven by an external device. The pull-up resistor keeps the pin in a high state.

The tri-state output circuit for each hold pin.

The tri-state output circuit for each hold pin.

The diagram below shows how this circuitry looks on the die. In this image, the metal and polysilicon layers have been removed with acid to show the underlying doped silicon regions. The thin white stripes are transistor gates where polysilicon wiring crossed the silicon. The black circles are vias that connected the silicon to the metal on top. The empty regions at the right are where the metal pads for HOLD and HLDA were. Next to the pads are the large transistors to pull the outputs high or low. Because the outputs require much higher current than internal signals, these transistors are much larger than logic transistors. They are composed of several transistors placed in parallel, resulting in the parallel stripes. The small pullup resistors are also visible. For efficiency, these resistors are actually depletion-mode transistors, specially doped to act as constant-current sources.

The HOLD/HLDA pin circuitry on the die.

The HOLD/HLDA pin circuitry on the die.

At the left, some of the circuitry is visible. The large output transistors are driven by "superbuffers" that provide more current than regular NMOS buffers. (A superbuffer uses separate transistors to pull the signal high and low, rather than using a pull-up to pull the signal high as in a standard NMOS gate.) The small transistors are the pass transistors that gate output signals according to the clock. The thick rectangles are crossovers, allowing the vertical metal wiring (no longer visible) to cross over a horizontal signal in the signal layer. The 8086 has only a single metal layer, so the layout requires a crossover if signals will otherwise intersect. Because silicon's resistance is much higher than metal's resistance, the crossover is relatively wide to reduce the resistance.

The problem with a pull-up resistor is that it is relatively slow when connected to a line with high capacitance. You essentially end up with a resistor-capacitor delay circuit, as the resistor slowly charges the line and brings it up to a high level. To get around this, the 8086 has an active drive circuit to pulse the RQ/GT lines high to pull them back from a low level. This circuit pulls the line high one cycle after the 8086 drops it low for a grant acknowledge. This circuit also pulls the line high after the external device pulls it low.11 (The schematic for this circuit is in the footnotes.) The curious thing is that I couldn't find this behavior documented in the datasheets. The datasheets describe the internal pull-up resistor, but don't mention that the 8086 actively pulls the lines high.12

Conclusions

The hold circuitry was a key feature of the 8086, largely because it was necessary for the 8087 math coprocessor chip. The hold circuitry seems like it should be straightforward, but there are many corner cases in this circuitry: it interacts with unaligned memory accesses, the LOCK prefix, and minimum vs. maximum modes. As a result, it is fairly complicated.

Personally, I find the hold circuitry somewhat unsatisfying to study, with few fundamental concepts but a lot of special-case logic. The circuitry seems overly complicated for what it does. Much of the complexity is probably due to the wildly different behavior of the pins between minimum and maximum mode. Intel should have simply used a larger package (like the Motorola 68000) rather than re-using pins to support different modes, as well as using the same pin for a request and a grant. It's impressive, however, the same circuitry was made to work for both minimum and maximum modes, despite using completely different signals to request and grant holds. This circuitry must have been a nightmare for Intel's test engineers, trying to ensure that the circuitry performed properly when there were so many corner cases and potential race conditions.

I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space and Bluesky as @righto.com so you can follow me there too.

Notes and references

  1. The timing of priority between RQ0 and RQ1 is left vague in the documentation. In practice, even if RQ1 is requested first, a later RQ0 can still preempt it until the point that RQ1 is internally granted (i.e. the second flip-flop is activated). This happens before the hold is externally acknowledged, so it's not obvious to the user at what point priority no longer applies. 

  2. The RQ/GT0 and RQ/GT1 signals are active-low. These signals should have an overbar to indicate this, but it makes the page formatting ugly :-) 

  3. Modern x86 processors still support the 8087 (x87) instruction set. Starting with the 80486DX, the floating point unit was included on the CPU die, rather than as an external coprocessor. The x87 instruction set used a stack-based model, which made it harder to parallelize. To mitigate this, Intel introduced SSE in 1999, a different set of floating point instructions that worked on an independent register set. The x87 instructions are now considered mostly obsolete and are deprecated in 64-bit Windows

  4. The 8087 provides another RQ/GT input line for an external device. Thus, two external devices can still be used in a system with an 8087. That is, although the 8087 uses up one of the 8086's two RQ/GT lines, the 8087 provides another one, so there are still two lines available. The 8087 has logic to combine its bus requests and external bus requests into a single RQ/GT line to the 8086. 

  5. Confusingly, some of the flip-flops in the hold circuit transistion when the clock goes high, while others use the inverted clock signal and transition when the clock goes low. Moreover, the flip-flops are inconsistent about how they treat the data. In each group of three flip-flops, the first flip-flop is active-high, while the remaining flip-flops are active-low. For the most part, I'll ignore this in the discussion. You can look at the schematic if you want to know the details. 

  6. The schematics below shows my reverse-engineered schematic for the hold circuitry. I have partitioned the schematic into the hold logic and the output driver circuitry. This split matches the physical partitioning of the circuitry on the die.

    In the first schematic, the upper part handles HOLD and request0, while the lower part handles request1. There is some circuitry in the middle to handle the common enabling and to generate the internal hold signal. I won't explain the circuitry in much detail, but there are a few things I want to point out. First, even though the hold circuit seems like it should be simple, there are a lot of gates connected in complicated ways. Second, although there are many inverters, NAND, and NOR gates, there are also complex gates such as AND-NOR, OR-NAND, AND-OR-NAND, and so forth. These are implemented as single gates. Due to how gates are constructed from NMOS transistors, it is just as easy to build a hierarchical gate as a single gate. (The last step must be inversion, though.) The XOR gates are more complex; they are constructed from a NOR gate and an AND-NOR gate.

    Schematic of the hold circuitry. Click this image (or any other) for a larger version.

    Schematic of the hold circuitry. Click this image (or any other) for a larger version.

    The schematic below shows the output circuits for the two pins. These circuits are similar, but have a few differences because only the bottom one is used as an output (HLDA) in minimum mode. Each circuit has two inputs: what the current value of the pin is, and what the desired value of the pin is.

    Schematic of the pin output circuits.

    Schematic of the pin output circuits.

     

  7. Interestingly, the external pins aren't taken out of tri-state mode immediately when the HLDA signal is dropped. Instead, the 8086's bus drivers are re-enabled when a bus cycle starts, which is slightly later. The bus circuitry has a separate flip-flop to manage the enable/disable state, and the start of a bus cycle is what re-enables the bus. This is another example of behavior that the documentation leaves ambiguous. 

  8. There's one more complication for the hold-out signal. If a hold is granted on one line, a request comes in on the other line, and then the hold is released on the first line, the desired behavior is for the bus to remain in the hold state as the hold switches to the second line. However, because of the way a hold on line 1 blocks a hold on line 0, the GT1 second flip-flop will drop a cycle before the GT0 second flip-flop is activated. This would cause hold-out to drop for a cycle and the 8086 could start unwanted bus activity. To prevent this case, the hold-out line is also activated if there is an RQ0 request and RQ1 is granted. This condition seems a bit random but it covers the "gap". I have to wonder if Intel planned the circuit this way, or they added the extra test as a bug fix. (The asymmetry of the priority circuit causes this problem to occur only going from a hold on line 1 to line 0, not the other direction.) 

  9. The pulse-generating circuit is a bit tricky. A pulse signal is generated if the request has been accepted, has not been granted, and will be granted on the next clock (i.e. no memory request is active so the flip-flop is enabled). (Note that the pulse will be one cycle wide, since granting the request on the next cycle will cause the conditions to be no longer satisfied.) This provides the pulse one clock cycle before the flip-flop makes it "official". Moreover, the signals come from the inverted Q outputs from the flip-flops, which are updated half a clock cycle earlier. The result is that the pulse is generated 1.5 clock cycles before the flip-flop output. Presumably the point of this is to respond to hold requests faster, but it seems overly complicated. 

  10. The request pulse is required to be one clock cycle wide. The feedback loop shows why: if the request is longer than one clock cycle, the first flip-flop will repeatedly toggle on and off, resulting in unexpected behavior. 

  11. The details of the active pull-up circuitry don't make sense to me. First it XORs the state of the pin with the desired state of the pin and uses this signal to control a multiplexer, which generates the pull-up action based on other gates. The result of all this ends up being simply NAND, implemented with excessive gates. Another issue is that minimum mode blocks the active pull-up, which makes sense. But there are additional logic gates so minimum mode can affect the value going into the multiplexer, which gets blocked in minimum mode, so that logic seems wasted. There are also two separate circuits to block pull-up during reset. My suspicion is that the original logic accumulated bug fixes and redundant logic wasn't removed. But it's possible that the implementation is doing something clever that I'm just missing. 

  12. My analysis of the RQ/GT lines being pulled high is based on simulation. It would be interesting to verify this behavior on a physical 8086 chip. By measuring the current out of the pin, the pull-up pulses should be visible. 

The complex history of the Intel i960 RISC processor

1 July 2023 at 16:32

The Intel i960 was a remarkable 32-bit processor of the 1990s with a confusing set of versions. Although it is now mostly forgotten (outside the many people who used it as an embedded processor), it has a complex history. It had a shot at being Intel's flagship processor until x86 overshadowed it. Later, it was the world's best-selling RISC processor. One variant was a 33-bit processor with a decidedly non-RISC object-oriented instruction set; it became a military standard and was used in the F-22 fighter plane. Another version powered Intel's short-lived Unix servers. In this blog post, I'll take a look at the history of the i960, explain its different variants, and examine silicon dies. This chip has a lot of mythology and confusion (especially on Wikipedia), so I'll try to clear things up.

Roots: the iAPX 432

"Intel 432": Cover detail from Introduction to the iAPX 432 Architecture.

"Intel 432": Cover detail from Introduction to the iAPX 432 Architecture.

The ancestry of the i960 starts in 1975, when Intel set out to design a "micro-mainframe", a revolutionary processor that would bring the power of mainframe computers to microprocessors. This project, eventually called the iAPX 432, was a huge leap in features and complexity. Intel had just released the popular 8080 processor in 1974, an 8-bit processor that kicked off the hobbyist computer era with computers such as the Altair and IMSAI. However, 8-bit microprocessors were toys compared to 16-bit minicomputers like the PDP-11, let alone mainframes like the 32-bit IBM System/370. Most companies were gradually taking minicomputer and mainframe features and putting them into microprocessors, but Intel wanted to leapfrog to a mainframe-class 32-bit processor. The processor would make programmers much more productive by bridging the "semantic gap" between high-level languages and simple processors, implementing many features directly into the processor.

The 432 processor included memory management, process management, and interprocess communication. These features were traditionally part of the operating system, but Intel built them in the processor, calling this the "Silicon Operating System". The processor was also one of the first to implement the new IEEE 754 floating-point standard, still in use by most processors. The 432 also had support for fault tolerance and multi-processor systems. One of the most unusual features of the 432 was that instructions weren't byte aligned. Instead, instructions were between 6 and 321 bits long, and you could jump into the middle of a byte. Another unusual feature was that the 432 was a stack-based machine, pushing and popping values on an in-memory stack, rather than using general-purpose registers.

The 432 provided hardware support for object-oriented programming, built around an unforgeable object pointer called an Access Descriptor. Almost every structure in a 432 program and in the system itself is a separate object. The processor provided fine-grain security and access control by checking every object access to ensure that the user had permission and was not exceeding the bounds of the object. This made buffer overruns and related classes of bugs impossible, unlike modern processors.

This photo from the Intel 1981 annual report shows Intel's 432-based development computer and three of the engineers.

This photo from the Intel 1981 annual report shows Intel's 432-based development computer and three of the engineers.

The new, object-oriented Ada language was the primary programming language for the 432. The US Department of Defense developed the Ada language in the late 1970s and early 1980s to provide a common language for embedded systems, using the latest ideas from object-oriented programming. Proponents expected Ada to become the dominant computer language for the 1980s and beyond. In 1979, Intel realized that Ada was a good target for the iAPX 432, since they had similar object and task models. Intel decided to "establish itself as an early center of Ada technology by using the language as the primary development and application language for the new iAPX 432 architecture." The iAPX 432's operating system (iMAX 432) and other software were written in Ada, using one of the first Ada compilers.

Unfortunately, iAPX 432 project was way too ambitious for its time. After a couple of years of slow progress, Intel realized that they needed a stopgap processor to counter competitors such as Zilog and Motorola. Intel quickly designed a 16-bit processor that they could sell until the 432 was ready. This processor was the Intel 8086 (1978), which lives on in the x86 architecture used by most computers today. Critically, the importance of the 8086 was not recognized at the time. In 1981, IBM selected Intel's 8088 processor (a version of the 8086 with an 8-bit bus) for the IBM PC. In time, the success of the IBM PC and compatible systems led to Intel's dominance of the microprocessor market, but in 1981 Intel viewed the IBM PC as just another design win. As Intel VP Bill Davidow later said, "We knew it was an important win. We didn't realize it was the only win."

Caption: IBM chose Intel's high performance 8088 microprocessor as the central processing unit for the IBM Personal Computer, introduced in 1981. Seven Intel peripheral components are also integrated into the IBM Personal Computer.  From Intel's 1981 annual report.

Caption: IBM chose Intel's high performance 8088 microprocessor as the central processing unit for the IBM Personal Computer, introduced in 1981. Seven Intel peripheral components are also integrated into the IBM Personal Computer. From Intel's 1981 annual report.

Intel finally released the iAPX 432 in 1981. Intel's 1981 annual report shows the importance of the 432 to Intel. A section titled "The Micromainframe™ Arrives" enthusiastically described the iAPX 432 and how it would "open the door to applications not previously feasible". To Intel's surprise, the iAPX 432 ended up as "one of the great disaster stories of modern computing" as the New York Times put it. The processor was so complicated that it was split across two very large chips:1 one to decode instructions and a second to execute them Delivered years behind schedule, the micro-mainframe's performance was dismal, much worse than competitors and even the stopgap 8086.2 Sales were minimal and the 432 quietly dropped out of sight.

My die photos of the two chips that make up the iAPX 432 General Data Processor. Click for a larger version.

My die photos of the two chips that make up the iAPX 432 General Data Processor. Click for a larger version.

Intel picks a 32-bit architecture (or two, or three)

In 1982, Intel still didn't realize the importance of the x86 architecture. The follow-on 186 and 286 processors were released but without much success at first.3 Intel was working on the 386, a 32-bit successor to the 286, but their main customer IBM was very unenthusiastic.4 Support for the 386 was so weak that the 386 team worried that the project might be dead.5 Meanwhile, the 432 team continued their work. Intel also had a third processor design in the works, a 32-bit VAX-like processor codenamed P4.6

Intel recognized that developing three unrelated 32-bit processors was impractical and formed a task force to develop a Single High-End Architecture (SHEA). The task force didn't achieve a single architecture, but they decided to merge the 432 and the P4 into a processor codenamed the P7, which would become the i960. They also decided to continue the 386 project. (Ironically, in 1986, Intel started yet another 32-bit processor, the unrelated i860, bringing the number of 32-bit architectures back to three.)

At the time, the 386 team felt that they were treated as the "stepchild" while the P7 project was the focus of Intel's attention. This would change as the sales of x86-based personal computers climbed and money poured into Intel. The 386 team would soon transform from stepchild to king.5

The first release of the i960 processor

Meanwhile, the 1980 paper The case for the Reduced Instruction Set Computer proposed a revolutionary new approach for computer architecture: building Reduced Instruction Set Computers (RISC) instead of Complex Instruction Set Computers (CISC). The paper argued that the trend toward increasing complexity was doing more harm than good. Instead, since "every transistor is precious" on a VLSI chip, the instruction set should be simplified, only adding features that quantitatively improved performance.

The RISC approach became very popular in the 1980s. Processors that followed the RISC philosophy generally converged on an approach with 32-bit easy-to-decode instructions, a load-store architecture (separating computation instructions from instructions that accessed memory), straightforward instructions that executed in one clock cycle, and implementing instructions directly rather than through microcode.

The P7 project combined the RISC philosophy and the ideas from the 432 to create Intel's first RISC chip, originally called the 809607 and later the i960. The chip, announced in 1988, was significant enough for coverage in the New York Times. Analysts said that the chip was marketed as an embedded controller to avoid stealing sales from the 80386. However, Intel's claimed motivation was the size of the embedded market; Intel chip designer Steve McGeady said at the time, "I'd rather put an 80960 in every antiskid braking system than in every Sun workstation.” Nonetheless, Intel also used the i960 as a workstation processor, as will be described in the next section.

The block diagram below shows the microarchitecture of the original i960 processors. The microarchitecture of the i960 followed most (but not all) of the common RISC design: a large register set, mostly one-cycle instructions, a load/store architecture, simple instruction formats, and a pipelined architecture. The Local Register Cache contains four sets of the 16 local registers. These "register windows" allow the registers to be switched during function calls without the delay of saving registers to the stack. The micro-instruction ROM and sequencer hold microcode for complex instructions; microcode is highly unusual for a RISC processor. The chip's Floating Point Unit8 and Memory Management Unit are advanced features for the time.

The microarchitecture of the i960 XA. FPU is Floating Point Unit. IEU is Instruction Execution Unit. MMU is Memory Management Unit. From the 80960 datasheet.

The microarchitecture of the i960 XA. FPU is Floating Point Unit. IEU is Instruction Execution Unit. MMU is Memory Management Unit. From the 80960 datasheet.

It's interesting to compare the i960 to the 432: the programmer-visible architectures are completely different, while the instruction sets are almost identical.9 Architecturally, the 432 is a stack-based machine with no registers, while the i960 is a load-store machine with many registers. Moreover, the 432 had complex variable-length instructions, while the i960 uses simple fixed-length load-store instructions. At the low level, the instructions are different due to the extreme architectural differences between the processors, but otherwise, the instructions are remarkably similar, modulo some name changes.

The key to understanding the i960 family is that there are four architectures, ranging from a straightforward RISC processor to a 33-bit processor implementing the 432's complex instruction set and object model.10 Each architecture adds additional functionality to the previous one:

  • The Core architecture consists of a "RISC-like" core.
  • The Numerics architecture extends Core with floating-point.
  • The Protected architecture extends Numerics with paged memory management, Supervisor/User protection, string instructions, process scheduling, interprocess communication for OS, and symmetric multiprocessing.
  • The Extended architecture extends Protected with object addressing/protection and interprocess communication for applications. This architecture used an extra tag bit, so registers, the bus, and memory were 33 bits wide instead of 32.

These four versions were sold as the KA (Core), KB (Numerics), MC (Protected), and XA (Extended). The KA chip cost $174 and the KB version cost $333 while MC was aimed at the military market and cost a whopping $2400. The most advanced chip (XA) was, at first, kept proprietary for use by BiiN (discussed below), but was later sold to the military. The military versions weren't secret, but it is very hard to find documentation on them.11

The strangest thing about these four architectures is that the chips were identical, using the same die. In other words, the simple Core chip included all the circuitry for floating point, memory management, and objects; these features just weren't used.12 The die photo below shows the die, with the main functional units labeled. Around the edge of the die are the bond pads that connect the die to the external pins. Note that the right half of the chip has almost no bond pads. As a result, the packaged IC had many unused pins.13

The i960 KA/KB/MC/XA with the main functional blocks labeled. Click this image (or any other) for a larger version. Die image courtesy of Antoine Bercovici.  Floorplan from The 80960 microprocessor architecture.

The i960 KA/KB/MC/XA with the main functional blocks labeled. Click this image (or any other) for a larger version. Die image courtesy of Antoine Bercovici. Floorplan from The 80960 microprocessor architecture.

One advanced feature of the i960 is register scoreboarding, visible in the upper-left corner of the die. The idea is that loading a register from memory is slow, so to improve performance, the processor executes the following instructions while the load completes, rather than waiting. Of course, an instruction can't be executed if it uses a register that is being loaded, since the value isn't there. The solution is a "scoreboard" that tracks which registers are valid and which are still being loaded, and blocks an instruction if the register isn't ready. The i960 could handle up to three outstanding reads, providing a significant performance gain.

The most complex i960 architecture is the Extended architecture, which provides the object-oriented system. This architecture is designed around an unforgeable pointer called an Access Descriptor that provides protected access to an object. What makes the pointer unforgeable is that it is 33 bits long with an extra bit that indicates an Access Descriptor. You can't set this bit with a regular 32-bit instruction. Instead, an Access Descriptor can only be created with a special privileged instruction, "Create AD".14

An Access Descriptor is a pointer to an object table. From BiiN Object Computing.

An Access Descriptor is a pointer to an object table. From BiiN Object Computing.

The diagram above shows how objects work. The 33-bit Access Descriptor (AD) has its tag bit set to 1, indicating that it is a valid Access Descriptor. The Rights field controls what actions can be performed by this object reference. The AD's Object Index references the Object Table that holds information about each object. In particular, the Base Address and Size define the object's location in memory and ensure that an access cannot exceed the bounds of the object. The Type Definition defines the various operations that can be performed on the object. Since this is all implemented by the processor at the instruction level, it provides strict security.

Gemini and BiiN

The i960 was heavily influenced by a partnership called Gemini and then BiiN. In 1983, near the start of the i960 project, Intel formed a partnership with Siemens to build high-performance fault-tolerant servers. In this partnership, Intel would provide the hardware while Siemens developed the software. This partnership allowed Intel to move beyond the chip market to the potentially-lucrative systems market, while adding powerful systems to Siemens' product line. The Gemini team contained many of the people from the 432 project and wanted to continue the 432's architecture. Gemini worked closely with the developers of the i960 to ensure the new processor would meet their needs; both teams worked in the same building at Intel's Jones Farm site in Oregon.

The BiiN 60 system. From BiiN 60 Technical Overview.

The BiiN 60 system. From BiiN 60 Technical Overview.

In 1988, shortly after the announcement of the i960 chips, the Intel/Siemens partnership was spun off into a company called BiiN.15 BiiN announced two high-performance, fault-tolerant, multiprocessor systems. These systems used the i960 XA processor16 and took full advantage of the object-oriented model and other features provided by its Extended architecture. The BiiN 20 was designed for departmental computing and cost $43,000 to $80,000. It supported 50 users (connected by terminals) on one 5.5-MIPS i960 processor. The larger BiiN 60 handled up to 1000 terminals and cost $345,000 to $815,000. The Unix-compatible BiiN operating system (BiiN/OS) and utilities were written in 2 million lines of Ada code.

BiiN described many potential markets for these systems: government, factory automation, financial services, on-line transaction processing, manufacturing, and health care. Unfortunately, as ExtremeTech put it, "the market for fault-tolerant Unix workstations was approximately nil." BiiN was shut down in 1989, just 15 months after its creation as profitability kept becoming more distant. BiiN earned the nickname "Billions invested in Nothing"; the actual investment was 1700 person-years and $430 million.

The superscalar i960 CA

One year after the first i960, Intel released the groundbreaking i960 CA. This chip was the world's first superscalar microprocessor, able to execute more than one instruction per clock cycle. The chip had three execution units that could operate in parallel: an integer execution unit, a multiply/divide unit, and an address generation unit that could also do integer arithmetic.17 To keep the execution units busy, the i960 CA's instruction sequencer examined four instructions at once and determined which ones could be issued in parallel without conflict. It could issue two instructions and a branch each clock cycle, using branch prediction to speculatively execute branches out of order.

The i960 CA die, with functional blocks labeled. Photo courtesy of Antoine Bercovici. Functional blocks from the datasheet.

The i960 CA die, with functional blocks labeled. Photo courtesy of Antoine Bercovici. Functional blocks from the datasheet.

Following the CA, several other superscalar variants were produced: the CF had more cache, the military MM implemented the Protected architecture (memory management and a floating point unit), and the military MX implemented the Extended architecture (object-oriented).

The image below shows the 960 MX die with the main functional blocks labeled. (I think the MM and MX used the same die but I'm not sure.18) Like the i960 CA, this chip has multiple functional units that can be operated in parallel for its superscalar execution. Note the wide buses between various blocks, allowing high internal bandwidth. The die was too large for the optical projection of the mask, with the result that the corners of the circuitry needed to be rounded off.

The i960MX die with the main functional blocks labeled. This is a die photo I took, with labels based on my reverse engineering.

The i960MX die with the main functional blocks labeled. This is a die photo I took, with labels based on my reverse engineering.

The block diagram of the i960 MX shows the complexity of the chip and how it is designed for parallelism. The register file is the heart of the chip. It is multi-ported so up to 6 registers can be accessed at the same time. Note the multiple, 256-bit wide buses between the register file and the various functional units. The chip has two buses: a high-bandwidth Backside Bus between the chip and its external cache and private memory; and a New Local Bus, which runs at half the speed and connects the chip to main memory and I/O. For highest performance, the chip's software would access its private memory over the high-speed bus, while using the slower bus for I/O and shared memory accesses.

A functional block diagram of the i960 MX. From Intel Military and Special Projects Handbook, 1993.

A functional block diagram of the i960 MX. From Intel Military and Special Projects Handbook, 1993.

Military use and the JIAWG standard

The i960 had a special role in the US military. In 1987 the military mandated the use of Ada as the single, common computer programming language for Defense computer resources in most cases.19 In 1989, the military created the JIAWG standard, which selected two 32-bit instruction set architectures for military avionics. These architectures were the i960's Extended architecture (implemented by the i960 XA) and the MIPS architecture (based on a RISC project at Stanford).20 The superscalar i960 MX processor described earlier soon became a popular JIAWG-compliant processor, since it had higher performance than the XA.

Hughes designed a modular avionics processor that used the i960 XA and later the MX. A dense module called the HAC-32 contained two i960 MX processors, 2 MB of RAM, and an I/O controller in a 2"×4" multi-chip module, slightly bigger than a credit card. This module had bare dies bonded to the substrate, maximizing the density. In the photo below, the two largest dies are the i960 MX while the numerous gray rectangles are memory chips. This module was used in F-22's Common Integrated Processor, the RAH-66 Comanche helicopter (which was canceled), the F/A-18's Stores Management Processor (the computer that controls attached weapons), and the AN/ALR-67 radar computer.

The Hughes HAC-32. From Avionics Systems Design.

The Hughes HAC-32. From Avionics Systems Design.

The military market is difficult due to the long timelines of military projects, unpredictable volumes, and the risk of cancellations. In the case of the F-22 fighter plane, the project started in 1985 when the Air Force sent out proposals for a new Advanced Tactical Fighter. Lockheed built a YF-22 prototype, first flying it in 1990. The Air Force selected the YF-22 over the competing YF-23 in 1991 and the project moved to full-scale development. During this time, at least three generations of processors became obsolete. In particular, the i960MX was out of production by the time the F-22 first flew in 1997. At one point, the military had to pay Intel $22 million to restart the i960 production line. In 2001, the Air Force started a switch to the PowerPC processor, and finally the plane entered military service in 2005. The F-22 illustrates how the fast-paced obsolescence of processors is a big problem for decades-long military projects.

The Common Integrated Processor for the F-22, presumably with i960 MX chips inside. It is the equivalent of two Cray supercomputers and was the world's most advanced, high-speed computer system for a fighter aircraft. Source: NARA/Hughes Aircraft Co./T.W. Goosman.

The Common Integrated Processor for the F-22, presumably with i960 MX chips inside. It is the equivalent of two Cray supercomputers and was the world's most advanced, high-speed computer system for a fighter aircraft. Source: NARA/Hughes Aircraft Co./T.W. Goosman.

Intel charged thousands of dollars for each i960 MX and each F-22 contained a cluster of 35 i960 MX processors, so the military market was potentially lucrative. The Air Force originally planned to buy 750 planes, but cut this down to just 187, which must have been a blow to Intel. As for the Comanche helicopter, the Army planned to buy 1200 of them, but the program was canceled entirely after building two prototypes. The point is that the military market is risky and low volume even in the best circumstances.21 In 1998, Intel decided to leave the military business entirely, joining AMD and Motorola.

Foreign militaries also made use of the i960. In 2008 a businessman was sentenced to 35 months in prison for illegally exporting hundreds of i960 chips into India for use in the radar for the Tejas Light Combat Aircraft.

i960: the later years

By 1990, the i960 was selling well, but the landscape at Intel had changed. The 386 processor was enormously successful, due to the Compaq Deskpro 386 and other systems, leading to Intel's first billion-dollar quarter. The 8086 had started as a stopgap processor to fill a temporary marketing need, but now the x86 was Intel's moneymaking engine. As part of a reorganization, the i960 project was transferred to Chandler, Arizona. Much of the i960 team in Oregon moved to the newly-formed Pentium Pro team, while others ended up on the 486 DX2 processor. This wasn't the end of the i960, but the intensity had reduced.

To reduce system cost, Intel produced versions of the i960 that had a 16-bit bus, although the processor was 32 bits internally. (This is the same approach that Intel used with the 8088 processor, a version of the 8086 processor with an 8-bit bus instead of 16.) The i960 SB had the "Numerics" architecture, that is, with a floating-point unit. Looking at the die below, we can see that the SB design is rather "lazy", simply the previous die (KA/KB/MC/XA) with a thin layer of circuitry around the border to implement the 16-bit bus. Even though the SB didn't support memory management or objects, Intel didn't remove that circuitry. The process was reportedly moved from 1.5 microns to 1 micron, shrinking the die to 270 mils square.

Comparison of the original i960 die and the i960 SB. Photos courtesy of Antoine Bercovici.

Comparison of the original i960 die and the i960 SB. Photos courtesy of Antoine Bercovici.

The next chip, the i960 SA, was the 16-bit-bus "Core" architecture, without floating point. The SA was based on the SB but Intel finally removed unused functionality from the die, making the die about 24% smaller. The diagram below shows how the address translation, translation lookaside buffer, and floating point unit were removed, along with much of the microcode (yellow). The instruction cache tags (purple), registers (orange), and execution unit (green) were moved to fit into the available space. The left half of the chip remained unchanged. The driver circuitry around the edges of the chip was also tightened up, saving a bit of space.

This diagram compares the SB and SA chips. Photos courtesy of Antoine Bercovici.

This diagram compares the SB and SA chips. Photos courtesy of Antoine Bercovici.

Intel introduced the high-performance Hx family around 1994. This family was superscalar like the CA/CF, but the Hx chips also had a faster clock, had much more cache, and included additional functionality such as timers and a guarded memory unit. The Jx family was introduced as the midrange, cost-effective line, faster and better than the original chips but not superscalar like the Hx. Intel attempted to move the i960 into the I/O controller market with the Rx family and the VH.23 This was part of Intel's Intelligent Input/Output specification (I2O), which was a failure overall.

For a while, the i960 was a big success in the marketplace and was used in many products. Laser printers and graphical terminals were key applications, both taking advantage of the i960's high speed to move pixels. The i960 was the world's best-selling RISC chip in 1994. However, without focused development, the performance of the i960 fell behind the competition, and its market share rapidly dropped.

Market share of embedded RISC processors. From ExtremeTech.

Market share of embedded RISC processors. From ExtremeTech.

By the late 1990s, the i960 was described with terms such as "aging", "venerable", and "medieval". In 1999, Microprocessor Report described the situation: "The i960 survived on cast-off semiconductor processes two to three generations old; the i960CA is still built in a 1.0-micron process (perhaps by little old ladies with X-Acto knives)."22

One of the strongest competitors was DEC's powerful StrongARM processor design, a descendant of the ARM chip. Even Intel's top-of-the-line i960HT fared pitifully against the StrongARM, with worse cost, performance, and power consumption. In 1997, DEC sued Intel, claiming that the Pentium infringed ten of DEC's patents. As part of the complex but mutually-beneficial 1997 settlement, Intel obtained rights to the StrongARM chip. As Intel turned its embedded focus from i960 to StrongARM, one writer wrote, "Things are looking somewhat bleak for Intel Corp's ten-year-old i960 processor." The i960 limped on for another decade until Intel officially ended production in 2007.

RISC or CISC?

The i960 challenges the definitions of RISC and CISC processors.24 It is generally considered a RISC processor, but its architect says "RISC techniques were used for high performance, CISC techniques for ease of use."25 John Mashey of MIPS described it as on the RISC/CISC border26 while Steve Furber (co-creator of ARM) wrote that it "includes many RISC ideas, but it is not a simple chip" with "many complex instructions which make recourse to microcode" and a design that "is more reminiscent of a complex, mainframe architecture than a simple, pipelined RISC." And they were talking about the i960 KB with the simple Numerics architecture, not the complicated Extended architecture!

Even the basic Core architecture has many non-RISC-like features. It has microcoded instructions that take multiple cycles (such as integer multiplication), numerous addressing modes27, and unnecessary instructions (e.g. AND NOT as well as NOT AND). It also has a large variety of datatypes, even more than the 432: integer (8, 16, 32, or 64 bit), ordinal (8, 16, 32, or 64 bit), decimal digits, bit fields, triple-word (96 bits), and quad-word (128 bits). The Numerics architecture adds floating-point reals (32, 64, or 80 bit) while the Protected architecture adds byte strings with decidedly CISC-like instructions to act on them.28

When you get to the Extended architecture with objects, process management, and interprocess communication instructions, the large instruction set seems obviously CISC.29 (The instruction set is essentially the same as 432 and the 432 is an extremely CISC processor.) You could argue that the i960 Core architecture is RISC and the Extended architecture is CISC, but the problem is that they are identical chips.

Of course, it doesn't really matter if the i960 is considered RISC, CISC, or CISC instructions running on a RISC core. But the i960 shows that RISC and CISC aren't as straightforward as they might seem.

Summary

The i960 chips can be confusing since there are four architectures, along with scalar vs. superscalar, and multiple families over time. I've made the table below to summarize the i960 family and the approximate dates. The upper entries are the scalar families while the lower entries are superscalar. The columns indicate the four architectural variants; although the i960 started with four variants, eventually Intel focused on only the Core. Note that each "x" family represents multiple chips.

CoreNumericsProtectedExtended
KAKBMCXAOriginal (1988)
SASB  Entry level, 16-bit data bus (1991)
Jx   Midrange (1993-1998)
Rx,VH   I/O interface (1995-2001)
CA,CF MMMXSuperscalar (1989-1992)
Hx   Superscalar, higher performance (1994)

Although the i960 is now mostly forgotten, it was an innovative processor for the time. The first generation was Intel's first RISC chip, but pushed the boundary of RISC with many CISC-like features. The i960 XA literally set the standard for military computing, selected by the JIAWG as the military's architecture. The i960 CA provided a performance breakthrough with its superscalar architecture. But Moore's Law means that competitors can rapidly overtake a chip, and the i960 ended up as history.

Thanks to Glen Myers, Kevin Kahn, Steven McGeady, and others from Intel for answering my questions about the i960. Thanks to Prof. Paul Lubeck for obtaining documentation for me. I plan to write more, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space and Bluesky as @righto.com so you can follow me there too.

Notes and references

  1. The 432 used two chips for the processor and a third chip for I/O. At the time, these were said to be "three of the largest integrated circuits in history." The first processor chip contained more than 100,000 devices, making it "one of the densest VLSI circuits to have been fabricated so far." The article also says that the 432 project "was the largest investment in a single program that Intel has ever made." See Ada determines architecture of 32-bit microprocessor, Electronics, Feb 24, 1981, pages 119-126, a very detailed article on the 432 written by the lead engineer and the team's manager. 

  2. The performance problems of the iAPX 432 were revealed by a student project at Berkeley, A performance evaluation of the Intel iAPX 432, which compared its performance with the VAX-11/780, Motorola 68000, and Intel 8086. Instead of providing mainframe performance, the 432 had a fraction of the performance of the competing systems. Another interesting paper is Performance effects of architectural complexity in the Intel 432, which examines in detail what the 432 did wrong. It concludes that the 432 could have been significantly faster, but would still have been slower than its contemporaries. An author of the paper was Robert Colwell, who was later hired by Intel and designed the highly-successful Pentium Pro architecture. 

  3. You might expect the 8086, 186, and 286 processors to form a nice progression, but it's a bit more complicated. The 186 and 286 processors were released at the same time. The 186 essentially took the 8086 and several support chips and integrated them onto a single die. The 286, on the other hand, extended the 8086 with memory management. However, its segment-based memory management was a bad design, using ideas from the Zilog MMU, and wasn't popular. The 286 also had a protected mode, so multiple processes could be isolated from each other. Unfortunately, protected mode had some serious problems. Bill Gates famously called the 286 "brain-damaged" echoing PC Magazine editor Bill Machrone and writer Jerry Pournelle, who both wanted credit for originating the phrase.

    By 1984, however, the 286 was Intel's star due to growing sales of IBM PCs and compatibles that used the chip. Intel's 1984 annual report featured "The Story of the 286", a glowing 14-page tribute to the 286. 

  4. Given IBM's success with IBM PC, Intel was puzzled that IBM wasn't interested in the 386 processor. It turned out that IBM had a plan to regain control of the PC so they could block out competitors that were manufacturing IBM PC compatibles. IBM planned to reverse-engineer Intel's 286 processor and build its own version. The computers would run the OS/2 operating system instead of Windows and use the proprietary Micro Channel architecture. However, the reverse-engineering project failed and IBM eventually moved to the Intel 386 processor. The IBM PS/2 line of computers, released in 1987, followed the rest of the plan. However, the PS/2 line was largely unsuccessful; rather than regaining control over the PC, IBM ended up losing control to companies such as Compaq and Dell. (For more, see Creating the Digital Future, page 131.) 

  5. The 386 team created an oral history that describes the development of the 386 in detail. Pages 5, 6, and 19 are most relevant to this post. 

  6. You might wonder why the processor was codenamed P4, since logically P4 should indicate the 486. Confusingly, Intel's processor codenames were not always sequential and they sometimes reused numbers. The numbers apparently started with P0, the codename for the Optimal Terminal Kit, a processor that didn't get beyond early planning. P5 was used for the 432, P4 for the planned follow-on, P7 for the i960, P10 for the i960 CA, and P12 for the i960 MX. (Apparently they thought that x86 wouldn't ever get to P4.)

    For the x86 processors, P1 through P6 indicated the 186, 286, 386, 486, 586, Pentium, and Pentium Pro as you'd expect. (The Pentium used a variety of codes for various versions, such as P54C, P24T, and P55C; I don't understand the pattern behind these.) For some reason, the i386SX was the P9 and the i486SX was the P23 and the i486DX2 was the P24. The Pentium 4 Willamette was the first new microarchitecture (NetBurst) since P6 so it was going to be P7, but Itanium took the P7 name codename so Willamette became P68. After that, processors were named after geographic features, avoiding the issues with numeric codenames.

    Other types of chips used different letter prefixes. The 387 numeric coprocessor was the N3. The i860 RISC processor was originally the N10, a numeric co-processor. The follow-on i860 XP was the N11. Support chips for the 486 included the C4 cache chip and the unreleased I4 interrupt controller. 

  7. At the time, Intel had a family of 16-bit embedded microcontrollers called MCS-96 featuring the 8096. The 80960 name was presumably chosen to imply continuity with the 8096 16-bit microcontrollers (MCS-96), even though the 8096 and the 80960 are completely different. (I haven't been able to confirm this, though.) Intel started calling the chip the "i960" around 1989. (Intel's chip branding is inconsistent: from 1987 to 1991, Intel's annual reports called the 386 processor the 80386, the 386, the Intel386, and the i386. I suspect their trademark lawyers were dealing with the problem that numbers couldn't be trademarked, which was the motivation for the "Pentium" name rather than 586.)

    Note that the i860 processor is completely unrelated to the i960 despite the similar numbers. They are both 32-bit RISC processors, but are architecturally unrelated. The i860 was targeted at high-performance workstations, while the i960 was targeted at embedded applications. For details on the i860, see The first million-transistor chip

  8. The Intel 80387 floating-point coprocessor chip used the same floating-point unit as the i960. The diagram below shows the 80387; compare the floating-point unit in the lower right corner with the matching floating-point unit in the i960 KA or SB die photo.

    The 80837 floating-point coprocessor with the main functional blocks labeled. Die photo courtesy of Antoine Bercovici. 80387 floor plan from The 80387 and its applications.

    The 80837 floating-point coprocessor with the main functional blocks labeled. Die photo courtesy of Antoine Bercovici. 80387 floor plan from The 80387 and its applications.

     

  9. I compared the instruction sets of the 432 and i960 and the i960 Extended instruction set seems about as close as you could get to the 432 while drastically changing the underlying architecture. If you dig into the details of the object models, there are some differences. Some instructions also have different names but the same function. 

  10. The first i960 chips were described in detail in the 1988 book The 80960 microprocessor architecture by Glenford Myers (who was responsible for the 80960 architecture at Intel) and David Budde (who managed the VLSI development of the 80960 components). This book discussed three levels of architecture (Core, Numerics, and Protected). The book referred to the fourth level, the Extended architecture (XA), calling it "a proprietary higher level of the architecture developed for use by Intel in system products" and did not discuss it further. These "system products" were the systems being developed at BiiN. 

  11. I could find very little documentation on the Extended architecture. The 80960XA datasheet provides a summary of the instruction set. The i960 MX datasheet provides a similar summary; it is in the Intel Military and Special Products databook, which I found after much difficulty. The best description I could find is in the 400-page BiiN CPU architecture reference manual. Intel has other documents that I haven't been able to find anywhere: i960 MM/MX Processor Hardware Reference Manual, and Military i960 MM/MX Superscalar Processor. (If you have one lying around, let me know.)

    The 80960MX Specification Update mentions a few things about the MX processor. My favorite is that if you take the arctan of a value greater than 32768, the processor may lock up and require a hardware reset. Oops. The update also says that the product is sold in die and wafer form only, i.e. without packaging. Curiously, earlier documentation said the chip was packaged in a 348-pin ceramic PGA package (with 213 signal pins and 122 power/ground pins). I guess Intel ended up only supporting the bare die, as in the Hughes HAC-32 module. 

  12. According to people who worked on the project, there were not even any bond wire changes or blown fuses to distinguish the chips for the four different architectures. It's possible that Intel used binning, selling dies as a lower architecture if, for example, the floating point unit failed testing. Moreover, the military chips presumably had much more extensive testing, checking the military temperature range for instance. 

  13. The original i960 chips (KA/KB/MC/XA) have a large number of pins that are not connected (marked NC on the datasheet). This has led to suspicious theorizing, even on Wikipedia, that these pins were left unconnected to control access to various features. This is false for two reasons. First, checking the datasheets shows that all four chips have the same pinout; there are no pins connected only in the more advanced versions. Second, looking at the packaged chip (below) explains why so many pins are unconnected: much of the chip has no bond pads, so there is nothing to connect the pins to. In particular, the right half of the die has only four bond pads for power. This is an unusual chip layout, but presumably the chip's internal buses made it easier to put all the connections at the left. The downside is that the package is more expensive due to the wasted pins, but I expect that BiiN wasn't concerned about a few extra dollars for the package.

    The i960 MC die, bonded in its package. Photo courtesy of Antoine Bercovici.

    The i960 MC die, bonded in its package. Photo courtesy of Antoine Bercovici.

    But you might wonder: the simple KA uses 32 bits and the complex XA uses 33 bits, so surely there must be another pin for the 33rd bit. It turns out that pin F3 is called CACHE on the KA and CACHE/TAG on the XA. The pin indicates if an access is cacheable, but the XA uses the pin during a different clock cycle to indicate whether the 32-bit word is data or an access descriptor (unforgeable pointer).

    So how does the processor know if it should use the 33-bit object mode or plain 32-bit mode? There's a processor control word called Processor Controls, that includes a Tag Enable bit. If this bit is set, the processor uses the 33rd bit (the tag bit) to distinguish Access Descriptors from data. If the bit is clear, the distinction is disabled and the processor runs in 32-bit mode. (See BiiN CPU Architecture Reference Manual section 16.1 for details.) 

  14. The 432 and the i960 both had unforgeable object references, the Access Descriptor. However, the two processors implemented Access Descriptors in completely different ways, which is kind of interesting. The i960 used a 33rd bit as a Tag bit to distinguish an Access Descriptor from a regular data value. Since the user didn't have access to the Tag bit, the user couldn't create or modify Access Descriptors. The 432, on the other hand, used standard 32-bit words. To protect Access Descriptors, each object was divided into two parts, each protected by a length field. One part held regular data, while the other part held Access Descriptors. The 432 had separate instructions to access the two parts of the object, ensuring that regular instructions could not tamper with Access Descriptors. 

  15. The name "BiiN" was developed by Lippincott & Margulies, a top design firm. The name was designed for a strong logo, as well as referencing binary code (so it was pronounced as "bine"). Despite this pedigree, "BiiN" was called one of the worst-sounding names in the computer industry, see Losing the Name Game

  16. Some sources say that BiiN used the i960 MX, not the XA, but they are confused. A paper from BiiN states that BiiN used the 80960 XA. (Sadly, BiiN was so short-lived that the papers introducing the BiiN system also include its demise.) Moreover, BiiN shut down in 1989 while the i960 MX was introduced in 1990, so the timeline doesn't work. 

  17. The superscalar i960 architecture is described in detail in The i960CA SuperScalar implementation of the 80960 architecture and Inside Intel's i960CA superscalar processor while the military MM version is described in Performance enhancements in the superscalar i960MM embedded microprocessor

  18. I don't have a die photo of the i960 MM, so I'm not certain of the relationship between the MM and the MX. The published MM die size is approximately the same as the MX. The MM block diagram matches the MX, except using 32 bits instead of 33. Thus, I think the MM uses the MX die, ignoring the Extended features, but I can't confirm this. 

  19. The military's Ada mandate remained in place for a decade until it was eliminated in 1997. Ada continues to be used by the military and other applications that require high reliability, but by now C++ has mostly replaced it. 

  20. The military standard was decided by the Joint Integrated Avionics Working Group, known as JIAWG. Earlier, in 1980, the military formed a 16-bit computing standard, MIL-STD-1750A. The 1750A standard created a new architecture, and numerous companies implemented 1750A-compatible processors. Many systems used 1750A processors and overall it was more successful than the JIAWG standard. 

  21. Chip designer and curmudgeon Nick Tredennick described the market for Intel's 960MX processor: "Intel invested considerable money and effort in the design of the 80960MX processor, for which, at the time of implementation, the only known application was the YF-22 aircraft. When the only prototype of the YF-22 crashed, the application volume for the 906MX actually went to zero; but even if the program had been successful, Intel could not have expected to sell more than a few thousand processors for that application." 

  22. In the early 1970s, chip designs were created by cutting large sheets of Rubylith film with X-Acto knives. Of course, that technology was long gone by the time of the i960.

    Intel photo of two women cutting Rubylith.

    Intel photo of two women cutting Rubylith.

     

  23. The Rx I/O processor chips combined a Jx processor core with a PCI bus interface and other hardware. The RM and RN versions were introduced in 2000 with a hardware XOR engine for RAID disk array parity calculations. The i960 VH (1998) was similar to Rx, but had only one PCI bus, no APIC bus, and was based on the JT core. The 80303 (2000) was the end of the i960 I/O processors. The 80303 was given a numeric name instead of an i960 name because Intel was transitioning from i960 to XScale at the time. The numeric name makes it look like a smooth transition from the 80303 (i960) I/O processor to the XScale-based I/O processors such as the 80333. The 803xx chips were also called IOP3xx (I/O Processor); some were chipsets with a separate XScale processor chip and an I/O companion chip.  

  24. Although the technical side of RISC vs. CISC is interesting, what I find most intriguing is the "social history" of RISC: how did a computer architecture issue from the 1980s become a topic that people still vigorously argue over 40 years later? I see several factors that keep the topic interesting:

    • RISC vs. CISC has a large impact on not only computer architects but also developers and users.
    • The topic is simple enough that everyone can have an opinion. It's also vague enough that nobody agrees on definitions, so there's lots to argue about.
    • There are winners and losers, but no resolution. RISC sort of won in the sense that almost all new instruction set architectures have been RISC. But CISC has won commercially with the victory of x86 over SPARC, PowerPC, Alpha, and other RISC contenders. But ARM dominates mobile and is moving into personal computers through Apple's new processors. If RISC had taken over in the 1980s as expected, there wouldn't be anything to debate. But x86 has prospered despite the efforts of everyone (including Intel) to move beyond it.
    • RISC vs. CISC takes on a "personal identity" aspect. For instance, if you're an "Apple" person, you're probably going to be cheering for ARM and RISC. But nobody cares about branch prediction strategies or caching.

    My personal opinion is that it is a mistake to consider RISC and CISC as objective, binary categories. (Arguing over whether ARM or the 6502 is really RISC or CISC is like arguing over whether a hotdog is a sandwich RISC is more of a social construct, a design philosophy/ideology that leads to a general kind of instruction set architecture that leads to various implementation techniques.

    Moreover, I view RISC vs. CISC as mostly irrelevant since the 1990s due to convergence between RISC and CISC architectures. In particular, the Pentium Pro (1995) decoded CISC instructions into "RISC-like" micro-operations that are executed by a superscalar core, surprising people by achieving RISC-like performance from a CISC processor. This has been viewed as a victory for CISC, a victory for RISC, nothing to do with RISC, or an indication that RISC and CISC have converged. 

  25. The quote is from Microprocessor Report April 1988, "Intel unveils radical new CPU family", reprinted in "Understanding RISC Microprocessors". 

  26. John Mashey of MIPS wrote an interesting article "CISCs are Not RISCs, and Not Converging Either" in the March 1992 issue of Microprocessor Report, extending a Usenet thread. It looks at multiple quantitative factors of various processors and finds a sharp line between CISC processors and most RISC processors. The i960, Intergraph Clipper, and (probably) ARM, however, were "truly on the RISC/CISC border, and, in fact, are often described that way." 

  27. The i960 datasheet lists an extensive set of addressing modes, more than typical for a RISC chip:

    • 12-bit offset
    • 32-bit offset
    • Register-indirect
    • Register + 12-bit offset
    • Register + 32-bit offset
    • Register + index-register×scale-factor
    • Register×scale-factor + 32-bit displacement
    • Register + index-register×scale-factor + 32-bit displacement
    where the scale-factor is 1, 2, 4, 8, or 16.

    See the 80960KA embedded 32-bit microprocessor datasheet for more information. 

  28. The i960 MC has string instructions that move, scan, or fill a string of bytes with a specified length. These are similar to the x86 string operations, but these are very unusual for a RISC processor. 

  29. The iAPX 432 instruction set is described in detail in chapter 10 of the iAPX 432 General Data Processor Architecture Reference Manual; the instructions are called "operators". The i960 Protected instruction set is listed in the 80960MC Programmer's Reference Manual while the i960 Extended instruction set is described in the BiiN CPU architecture reference manual.

    The table below shows the instruction set for the Extended architecture, the full set of object-oriented instructions. The instruction set includes typical RISC instructions (data movement, arithmetic, logical, comparison, etc), floating point instructions (for the Numerics architecture), process management instructions (for the Protected architecture), and the Extended object instructions (Access Descriptor operations). The "Mixed" instructions handle 33-bit values that can be either a tag (object pointer) or regular data. Note that many of these instructions have separate opcodes for different datatypes, so the complete instruction set is larger than this list, with about 240 opcodes.

    The Extended instruction set, from the i960 XA datasheet. Click for a larger version.

    The Extended instruction set, from the i960 XA datasheet. Click for a larger version.
❌
❌