❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Talking to memory: Inside the Intel 8088 processor's bus interface state machine

28 April 2024 at 00:47

In 1979, Intel introduced the 8088 microprocessor, a variant of the 16-bit 8086 processor. IBM's decision to use the 8088 processor in the IBM PC (1981) was a critical point in computer history, leading to the success of the x86 architecture. The designers of the IBM PC selected the 8088 for multiple reasons, but a key factor was that the 8088 processor's 8-bit bus was similar to the bus of the 8085 processor.1 The designers were familiar with the 8085 since they had selected it for the IBM System/23 Datamaster, a now-forgotten desktop computer, making the more-powerful 8088 processor an easy choice for the IBM PC.

The 8088 processor communicates over the bus with memory and I/O devices through a highly-structured sequence of steps called "T-states." A typical 8088 bus cycle consists of four T-states, with one T-state per clock cycle. Although a four-step bus cycle may sound straightforward, its implementation uses a complicated state machine making it one of the most difficult parts of the 8088 to explain. First, the 8088 has many special cases that complicate the bus cycle. Moreover, the bus cycle is really six steps, with two undocumented "extra" steps to make bus operations more efficient. Finally, the complexity of the bus cycle is largely arbitrary, a consequence of Intel's attempts to make the 8088's bus backward-compatible with the earlier 8080 and 8085 processors. However, investigating the bus cycle circuitry in detail provides insight into the timing of the processor's instructions. In addition, this circuitry illustrates the tradeoffs and implementation decisions that are necessary in a production processor. In this blog post, I look in detail at the circuitry that implements this state machine.

By examining the die of the 8088 microprocessor, I could reverse engineer the bus circuitry. The die photo below shows the 8088 microprocessor's silicon die under a microscope. Most visible in the photo is the metal layer on top of the chip, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below, with the two units running largely independently. The BIU handles bus communication (memory and I/O accesses), while the Execution Unit (EU) executes instructions. In the diagram, I've labeled the processor's key functional blocks. This article focuses on the bus state machine, highlighted in red, but other parts of the Bus Interface Unit will also play a role.

The 8088 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8088 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Although I'm focusing on the 8088 processor in this blog post, the 8086 is mostly the same. The 8086 and 8088 processors present the same 16-bit architecture to the programmer. The key difference is that the 8088 has an 8-bit data bus for communication with memory and I/O, rather than the 16-bit bus of the 8086. For the most part, the 8086 and 8088 are very similar internally, apart from trivial but numerous layout changes on the die. In this article, I'm focusing on the 8088 processor, but most of the description applies to the 8086 as well. Instead of constantly saying "8086/8088", I'll refer to the 8088 and try to point out places where the 8086 is different.

The bus cycle

In this section, I'll describe the basic four-step bus cycles that the 8088 performs.2 To start, the diagram below shows the states for a write cycle (slightly simplified3), when the 8088 writes to memory or an I/O device. The external bus activity is organized as four "T-states", each one clock cycle long and called T1, T2, T3, and T4, with specific actions during each state. During T1, the 8088 outputs the address on the pins. During the T2, T3, and T4 states, the 8088 outputs the data word on the same pins. The external memory or I/O device uses the T states to know when it is receiving address information or data over the bus lines.

A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

For a read, the bus cycle is slightly different from the write cycle, but uses the same four T-states. During T1, the address is provided on the pins, the same as for a write. After that, however, the processor's data pins are "tri-stated" so they float electrically, allowing the external memory to put data on the bus. The processor reads the data at the end of the T3 state.

A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

The purpose of the bus state machine is to move through these four T states for a read or a write. This process may seem straightforward, but (as is usually the case with the 8088) many complications make this process anything but easy. In the next sections, I'll discuss these complications. After that, I'll explain the state machine circuitry with a schematic.

Address calculation

One of the notable (if not hated) features of the 8088 processor is segmentation: the processor supports 1 megabyte of memory, but memory is partitioned into segments of 64 KB for compatibility with the earlier 8080 and 8085 processors. The 8088 calculates each 20-bit memory address by adding the value of a segment register to a 16-bit offset. This calculation is done by a dedicated address adder in the Bus Interface Unit, completely separate from the chip's ALU. (This address adder can be spotted in the upper left of the earlier die photo.)

Calculating the memory address complicates the bus cycle. As the timing diagrams above show, the processor issues the memory address during state T1 of the bus cycle. However, it takes time to perform the address calculation addition, so the address calculation must take place before T1. To accomplish this, there are two "invisible" bus states before T1; I call these states "TS" (T-start) and "T0". During these states, the Bus Interface Unit uses the address adder to compute the address, so the address will be available during the T1 state. These states are invisible to the external circuitry because they don't affect the signals from the chip.

Thus, a single memory operation takes six clock cycles: two preparatory cycles to compute the address before the four visible cycles. However, if multiple memory operations are performed, the operations are overlapped to achieve a degree of pipelining that improves performance. Specifically, the address calculation for the next memory operation takes place during the last two clock cycles of the current memory operation, saving two clock cycles. That is, for consecutive bus cycles, T3 and T4 of one bus cycle overlap with TS and T0 of the next cycle. In other words, during T3 and T4 of one bus cycle, the memory address gets computed for the next bus cycle. This pipelining significantly improves the performance of the 8088, compared to taking 6 clock cycles for each bus cycle.

With this timing, the address adder is free during cycles T1 and T2. To improve performance in another way, the 8088 uses the adder during this idle time to increment or decrement memory addresses. For instance, after popping a word from the stack, the stack pointer needs to be incremented by 2.5 Another case is block move operations (string operations), which need to increment or decrement the pointers each step. By using the address adder, the new pointer value is calculated "for free" as part of the memory cycle, without using the processors regular ALU.4

Address corrections

The address adder is used in one more context: correcting the Instruction Pointer value. Conceptually, the Instruction Pointer (or Program Counter) register points to the next instruction to execute. However, since the 8088 prefetches instructions, the Instruction Pointer indicates the next instruction to be fetched. Thus, the Instruction Pointer typically runs ahead of the "real" value. For the most part, this doesn't matter. This discrepancy becomes an issue, though, for a subroutine call, which needs to push the return address. It is also an issue for a relative branch, which jumps to an address relative to the current execution position.

To support instructions that need the next instruction address, the 8088 implements a micro-instruction CORR, which corrects the Instruction Pointer. This micro-instruction subtracts the length of the prefetch queue from the Instruction Pointer to determine the "real" Instruction Pointer. This subtraction is performed by the address adder, using correction constants that are stored in a small Constant ROM.

The tricky part is ensuring that using the address adder for correction doesn't conflict with other uses of the adder. The solution is to run a special shortened memory cycleβ€”just the TS and T0 statesβ€”while the CORR micro-instruction is performed.6 These states block a regular memory cycle from starting, preventing a conflict over the address adder.

A closeup of the address adder circuitry in the 8086. From my article on the adder.

A closeup of the address adder circuitry in the 8086. From my article on the adder.

Prefetching

The 8088 prefetches instructions before they are needed, loading instructions from memory into a 4-byte prefetch queue. Prefetching usually improves performance, but can result in an instruction's memory access being delayed by a prefetch, hurting overall performance. To minimize this delay, a bus request from an instruction will preempt a prefetch, even if the prefetch has gone through TS and T0. At that point, the prefetch hasn't created any bus activity yet (which first happens in T1), so preempting the prefetch can be done cleanly. To preempt the prefetch, the bus cycle state machine jumps back to TS, skipping over T1 through T4, and starting the desired access.

A prefetch will also be preempted by the micro-instruction that stops prefetching (SUSP) or the micro-instruction that corrects addresses (CORR). In these cases, there is no point in completing the prefetch, so the state machine cycle will end with T0.

Wait states

One problem with memory accesses is that the memory may be slower than the system's clock speed, a characteristic of less-expensive memory chips. The solution in the 1970s was "wait states". If the memory couldn't respond fast enough, it would tell the processor to add idle clock cycles called wait states, until the memory could respond.7 To produce a wait state, the memory (or I/O device) lowers the processor's READY pin until it is ready to proceed. During this time, the Bus Interface Unit waits, although the Execution Unit continues operation if possible. Although Intel's documentation gives the wait cycle a separate name (Tw), internally the wait is implemented by repeating the T3 state as long as the READY pin is not active.

Halts

Another complication is that the 8088 has a HALT instruction that halts program execution until an interrupt comes in. One consequence is that HALT stops bus operations (specifically prefetching, since stopping execution will automatically stop instruction-driven bus operations). A complication is that the 8088 indicates the HALT state to external devices by performing a special T1 bus cycle without any following bus cycles. But wait: there's another complication. External devices can take control of the bus through the HOLD functionality, allowing external devices to perform operations such as DMA (Direct Memory Access). When the device ends the HOLD, the 8088 performs another special T1 bus cycle, indicating that the HALT is still in effect. Thus, the bus state machine must generate these special T1 states based on HALT and HOLD actions. (I discussed the HALT process in detail here.)

Putting it all together: the state diagram

The state diagram below summarizes the different types of bus cycles. Each circle indicates a specific T-state, and the arrows indicate the transitions between states. The green line shows the basic bus cycle or cycles, starting in TS and then going around the cycle. From T3, a new cycle can start with T0 or the cycle will end with T4. Thus, new cycles can start every four clocks, but a full cycle takes six states (counting the "invisible" TS and T0). The brown line shows that the bus cycle will stay in T3 as long as there is a wait state. The red line shows the two cycles for a CORR correction, while the purple line shows the special T1 state for a HALT instruction. The cyan line shows that a prefetch cycle can be preempted after T0; the cycle will either restart at TS or end.

A state diagram showing the basic bus cycle and various complications.

A state diagram showing the basic bus cycle and various complications.

I'm showing states TS and T3 together since they overlap but aren't the same. Likewise, I'm showing T4 and T0 together. T4 is grayed out because it doesn't exist from the state machine's perspective; the circuitry doesn't take any particular action during T4.

The schematic below shows the implementation of the state machine. The four flip-flops represent the four states, with one flip-flop active at a time, generating states T0, T1, T2, and T3 (from top to bottom). Each output feeds into the logic for the next state, with T3 wrapping back to the top, so the circuit moves through the states in sequence. The flip-flops are clocked so the active state will move from one flip-flop to the next according to the system clock. State TS doesn't have its own flip-flop, but is represented by the input to the T0 flip-flop, so it happens one clock cycle earlier.8 State T4 doesn't have a flip-flop since it isn't "real" to the bus state machine. The logic gates handle the special cases: blocking the state transfer if necessary or starting a state.

Schematic of the state machine.

Schematic of the state machine.

I'll explain the logic for each state in more detail. The circuitry for the TS state has two AND gates to generate new bus cycles starting from TS. The first one (a) causes TS to happen with T3 if there is a pending bus request (and no HOLD). The second AND gate (b) starts a bus cycle if the bus is not currently active and there is a bus request or a CORR micro-instruction. The flip-flop causes T0 to follow T3/TS, one clock cycle later.

The next gates (c) generate the T1 state following T0 if there is pending bus activity and the cycle isn't preempted to T3. The AND gate (d) starts the special T1 for the HALT instruction.9 The T2 state follows T1 unless T1 was generated by a HALT (e).

The T3 logic is more complicated. First, T3 will always follow T2 (f). Next, a wait state will cause T3 to remain in T3 (g). Finally, for a preempt, T3 will follow T0 (h) if there is a prefetch and a microcode bus operation (i.e. an instruction specified the bus operation).

Next, I'll explain BUS-ACTIVE, an important signal that indicates if the bus is active or not. The Bus Interface Unit generates the BUS-ACTIVE signal to help control the state machine. The BUS-ACTIVE signal is also widely used in the Bus Interface Unit, controlling many functions such as transfers to and from the address registers. BUS-ACTIVE is generated by the complex circuit below that determines if the bus will be active, specifically in states T0 through T3. Because of the flip-flop, the computation of BUS-ACTIVE happens in the previous clock cycle.

The circuit to determine if the bus will be active next cycle.

The circuit to determine if the bus will be active next cycle.

In more detail, the signal BUS-ACTIVE-PRE indicates if the bus cycle will continue or will end on the next clock cycle. Delaying this signal through the flip-flop generates BUS-ACTIVE, which indicates if the bus is currently active in states T0 through T3. The top AND gate (a) is responsible for starting a cycle or keeping a cycle going (a1). It will allow a new cycle if there is a bus request (without HOLD) (a3). It will also allow a new cycle if there is a CORR micro-instruction prior to the T1 state (even if there is a HOLD, since this "fake" cycle won't use the bus) (a2). Finally, it allows a new cycle for a HALT, using T1-pre (a2).10 Next are the special cases that end a bus cycle. The second AND gate (b) ends the bus cycle after T3 unless there is a wait state or another bus request. (But a HOLD will block the next bus request.) The remaining gates end the cycle after T0 to preempt a prefetch if a CORR or SUSP micro-instruction occurs (d), or end after T1 for a HALT (e).

The BUS-ACTIVE circuit above uses a complex gate, a 5-input NOR gate fed by 5 AND gates with two attached OR gates. Surprisingly, this is implemented in the processor as a single gate with 14 inputs. Due to how gates are implemented with NMOS transistors, it is straightforward to implement this as a single gate. The inverter and NOR gate on the left, however, needed to be implemented separately, as they involve inversion; an NMOS gate must have a single inversion.

The bus state machine circuitry on the die.

The bus state machine circuitry on the die.

The diagram above shows the layout of the bus state machine circuitry on the die, zooming in on the top region of the die. The metal layer has been removed to expose the underlying silicon and polysilicon. The layout of each flip-flop is completely different, since the layout of each transistor is optimized to its surroundings. (This is in contrast to later processors such as the 386, which used standard-cell layout.) Even though the state machine consists of just a handful of flip-flops and gates, it takes a noticeable area on the die due to the large 3.2 Β΅m feature size of the 8088. (Modern processors have features measured in nanometers, not micrometers.)

Conclusions

The bus state machine is an example of how the 8088's design consists of complications on top of complications. While the four-state bus cycle seems straightforward at first, it gets more complicated due to prefetching, wait states, the HALT instruction, and the bus hold feature, not to mention the interactions between these features. While there were good motivations behind these features, they made the processor considerably more complicated. Looking at the internals of the 8088 gives me a better understanding of why simple RISC processors became popular.

The bus state machine is a key part of the read and write circuitry, moving the bus operation through the necessary T-states. However, the state machine is not the only component in this process; a higher-level circuit decides when to perform a read, write, or prefetch, as well as breaking a 16-bit operation into two 8-bit operations.11 These circuits work together with the higher-level circuit telling the state machine when to go through the states.

In my next blog post, I'll describe the higher-level memory circuit so follow me on Twitter @kenshirriff or RSS for updates. I'm also on Mastodon as oldbytes.space@kenshirriff. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process, and the 8086 registers earlier.

Notes and references

  1. The 8085 and 8088 processors both use a 4-step bus cycle for instruction fetching. For other reads and writes, the 8085's bus cycle has three steps compared to four for the 8088. Thus, the 8085 and 8088 bus cycles are similar but not an exact match. ↩

  2. The 8088 has separate instructions to read or write an I/O device. From the bus perspective, there's no difference between an I/O operation and a memory operation except that a pin on the chip indicates if the operation is for memory or I/O.

    The 8088 supports I/O operations for historical reasons, going back through the 8086, 8080, 8008, and the Datapoint 2200 system. In contrast, many other contemporary processors such as the 6502 used memory-mapped I/O, using standard memory accesses for I/O devices.

    The 8086 has a pin M/IO that is high for a memory access and low for an I/O access. External hardware uses this pin to determine how to handle the request. Confusingly, the pin's function is inverted on the 8088, providing IO/M. One motivation behind the 8088's 8-bit bus was to allow reuse of peripherals from the earlier 8-bit 8085 processor. Thus, the pin's function was inverted so it matched the 8085. (The pin is only available when the 8086/8088 is used in "minimum mode"; "maximum mode" remaps some of the pins, making the system more complicated but providing more control.) ↩

  3. I've made the timing diagram somewhat idealized so actions line up with the clock. In the real datasheet, all the signals are skewed by various amounts so the timing is more complicated. See the datasheet for pages of timing constraints on exactly when signals can change. ↩

  4. For more information on the implementation of the address adder, see my previous blog post. ↩

  5. The POP operation is an example of how the address adder updates a memory pointer. In this case, the stack address is moved from the Stack Pointer to the IND register in order to perform the memory read. As part of the read operation, the IND register is incremented by 2. The address is then moved from the IND register to the Stack Pointer. Thus, the address adder not only performs the segment arithmetic, but also computes the new value for the SP register.

    Note that the increment/decrement of the IND register happens after the memory operation. For stack operations, the SP must be decremented before a PUSH and incremented after a POP. The adder cannot perform a predecrement, so the PUSH instruction uses the ALU (Arithmetic/Logic Unit) to perform the decrement. ↩

  6. During the CORR micro-instruction, the Bus Interface Unit performs special TS and T0 states. Note that these states don't have any external effect, so they are invisible outside the processor. ↩

  7. The tradeoff with memory boards was that slower RAM chips were cheaper. The better RAM boards advertised "no wait states", but cheaper boards would add one or more wait states to every access, reducing performance. ↩

  8. Only the second half of the TS state has an effect on the Bus Interface Unit, so TS is not a full state like the other states. Specifically, a delayed TS signal is taken from the first half of the T0 flip-flop, and this signal is used to control various actions in the Bus Interface Unit. (Alternatively, you could think of this as an early T0 state.) This is why there isn't a separate flip-flop for the TS state. I suspect this is due to timing issues; by the time the TS state is generated by the logic, there isn't enough time to do anything with the state in that half clock cycle, due to propagation delays. ↩

  9. There is a bit more circuitry for the T1 state for a HALT. Specifically, there is a flip-flop that is set on this signal. On the next cycle, this flip-flop both blocks the generation of another T1 state and blocks the previous T1 state from progressing to T2. In other words, this flip-flop makes sure the special T1 lasts for one cycle. However, a HOLD state resets this flip-flop. That allows another special T1 to be generated when the HOLD ends. ↩

  10. The trickiest part of this circuit is using T1-pre to start a (short) cycle for HALT. The way it works is that the T1-pre signal only makes a difference if there isn't a bus cycle already active. The only way to get an "unexpected" T1-pre signal is if the state machine generates it for the first cycle of a HALT. Thus, the HALT triggers T1-pre and thus the bus-active signal. You might wonder why the bus-active uses this roundabout technique rather than getting triggered directly by HALT. The motivation is that the special T1 state for HALT requires the AND of three signals to ensure that the state is generated once for the HALT rather than continuously, but happens again after a HOLD, and waits until the current bus cycle is done. Instead of duplicating that AND gate, the circuit uses T1-pre which incorporates that logic. (This took me a long time to figure out.) ↩

  11. The 8088 has a 16-bit bus, compared to the 8088's 8-bit bus. Thus, a 16-bit bus operation on the 8088 will always require two 8-bit operations, while the 8086 can usually perform this operation in a single step. However, a 16-bit bus operation on the 8086 will still need to be broken into two 8-bit operations if the address is unaligned (i.e. odd). ↩

The Intel 8088 processor's instruction prefetch circuitry: a look inside

23 March 2024 at 03:20

In 1979, Intel introduced the 8088 microprocessor, a variant of the 16-bit 8086 processor. IBM's decision to use the 8088 processor in the IBM PC (1981) was a critical point in computer history, leading to the dominance of the x86 architecture that continues to the present.1 One way that the 8086 and 8088 increased performance was by prefetching: the processor fetches instructions from memory before they are needed, so the processor can execute them without waiting on the relatively slow memory. I've been reverse-engineering the 8088 from die photos and this blog post discusses what I've uncovered about the prefetch circuitry.

The die photo below shows the 8088 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; this article focuses on the prefetch queue components highlighted in red. The components in purple also play a role, and will be discussed below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions. In particular, the BIU fetches instructions, which are transferred from the prefetch queue to the Execution Unit via the queue bus.

The 8088 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8088 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 and 8088 processors present the same 16-bit architecture to the programmer. The key difference is that the 8088 has an 8-bit data bus for communication with memory and I/O, rather than the 16-bit bus of the 8086. The 8088's narrower bus reduced performance, since the processor only transfers one byte at a time rather than two. However, the 8-bit bus enabled cheaper computer hardware. The 8-bit bus was also a better match for hardware based on the older but popular 8-bit Intel 8080 and 8085 processors, allowing the reuse of 8-bit I/O circuitry for instance. Much of the IBM PC was based on the little-known IBM DataMaster, a computer built around the Intel 8085. Thus, selecting the 8088 processor was a natural choice for the IBM PC.

For the most part, the 8086 and 8088 are very similar internally, apart from trivial but numerous layout changes on the die. The biggest differences are in the Bus Interface Unit, the circuitry that communicates with memory and I/O devices, since this circuitry handles 16 bits in the 8086 versus 8 bits in the 8088. There are a few microcode differences between the two chips. One interesting change is that for performance reasons the 8088 has a smaller prefetch queue than the 8086 (four bytes instead of six). (I wrote about the 8086's prefetch circuity earlier.)

Prefetching and the architecture of the 8086 and 8088

The 8086 and 8088 were introduced at an interesting point in microprocessor history, when memory was becoming slower than the CPU. For the first microprocessors, the speed of the CPU and the speed of memory were comparable.2 However, as processors became faster, the speed of memory failed to keep up. The 8086 was probably the first microprocessor to prefetch instructions to improve performance. While modern microprocessors have megabytes of fast cache3 to act as a buffer between the CPU and much slower main memory, the 8088 has just 4 bytes of prefetch queue. However, this was enough to substantially increase performance.

Prefetching had a major impact on the design of the 8086 and thus the 8088. Earlier processors such as the 6502, 8080, or Z80 were deterministic: the processor fetched an instruction, executed the instruction, and so forth. Memory accesses corresponded directly to instruction fetching and execution and instructions took a predictable number of clock cycles. This all changed with the introduction of the prefetch queue. Memory operations became unlinked from instruction execution since prefetches happen as needed and when the memory bus is available.

To handle memory operations and instruction execution independently, the implementors of the 8086 and 8088 divided the processors into two processing units: the Bus Interface Unit (BIU) that handles memory accesses, and the Execution Unit (EU) that executes instructions. The Bus Interface Unit contains the instruction prefetch queue; it supplies instructions to the Execution Unit via the Q (queue) bus. The BIU also contains an adder (Ξ£) for address calculation, adding the segment register base to an address offset, among other things. The Execution Unit is what comes to mind when you think of a processor: it has most of the registers, the arithmetic/logic unit (ALU), and the microcode that implements instructions. The segment registers (CS, DS, SS, ES) and the Instruction Pointer (IP) are in the Bus Interface Unit since they are directly involved in memory accesses, while the general-purpose registers are in the Execution Unit.

Block diagram of the 8088 processor.
This diagram differs from most 8088 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor.
The "Internal Communication Registers" consist of the Indirect Register (IND) and the Operand Register (OPR). These hold a memory address and memory data value respectively.
From The 8086 Family User's Manual page 243.

Block diagram of the 8088 processor. This diagram differs from most 8088 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor. The "Internal Communication Registers" consist of the Indirect Register (IND) and the Operand Register (OPR). These hold a memory address and memory data value respectively. From The 8086 Family User's Manual page 243.

It may seem inefficient for the Bus Interface Unit to have its own adder instead of using the ALU, but there are reasons for the separate adder. First, every memory access uses the adder at least once to add the segment base and offset. The adder is also used to increment the PC or index registers. Since these operations are so frequent, they would create a bottleneck if they used the ALU. Second, since the Execution Unit and the Bus Interface Unit run asynchronously with respect to each other, it would be complicated to share the ALU without conflicts.

Prefetching had another major but little-known effect on the 8086 architecture: the designers were considering making the 8086 a two-chip microprocessor. Prefetching, however, required a one-chip design because the number of control signals required to synchronize prefetching across two chips exceeded the package pins available. This became a compelling argument for the one-chip design that was used for the 8086.4 (The unsuccessful Intel iAPX 432, which was under development at the same time, ended up being a two-chip processor: one to fetch and decode instructions, and one to execute them.)

Implementing the queue

The 8088's instruction prefetch queue is implemented with four 8-bit queue registers along with two hardware "pointers" into the queue. One two-bit counter keeps track of the current read position from 0 to 3, i.e. the queue register that will provide the next instruction byte. The second counter keeps track of the current write position, i.e. the queue register that will receive the next instruction from memory.5 As bytes are fetched from the queue, the read pointer advances. As bytes are added to the queue, the write pointer advances.

The diagram below shows an example queue configuration with two prefetched bytes. The middle two queue registers (Q1 and Q2) hold data. The read pointer indicates that the Execution Unit will get its next byte from Q1. The write pointer indicates that the next prefetched byte will go into Q3.

A queue configuration with two bytes in the prefetch queue. Bytes in blue hold prefetched data.

A queue configuration with two bytes in the prefetch queue. Bytes in blue hold prefetched data.

The diagram below shows how the queue pointers can wrap around. In this configuration, two more bytes have been written to the queue (Q3 and Q0), so the queue is full. The write pointer now points to Q1, the same as the read pointer.

A queue configuration with four bytes in the prefetch queue.

A queue configuration with four bytes in the prefetch queue.

There is an important ambiguity, however. Suppose that four bytes are read from the queue, so the read pointer advances four positions, wrapping around back to Q1. The queue is now empty, as shown below, but the pointers have the same position as the full case above. Thus, if the read pointer and the write pointer both point to the same position, the queue may be empty or full. To distinguish these cases, a flip-flop is set if the queue enters the empty state. This flip-flop generates a signal that Intel called MT (empty).

A queue configuration with the queue empty.

A queue configuration with the queue empty.

To determine how many bytes are in the queue, the queue circuitry uses a two-bit queue length value, along with the MT flip-flop value to distinguish the empty state. Conceptually, the queue length is generated by subtracting the read position from the write position. However, the implementation does not use a standard subtraction circuit, but instead uses hardcoded logic to determine the two bits of the length, as shown below.

The circuitry to determine the queue length.

The circuitry to determine the queue length.

The low bit of the length is the XOR of the two positions. In NMOS logic (used by the 8088), an AND-NOR gate is easy to implement, while an XOR gate is difficult. Thus, XOR is implemented as shown in the top circuit. (You can verify that if one input is 1 and the other is 0, the output is 1.) The high-order bit of the length is also based on an AND-NOR gate, one with six inputs. Each input is a combination of read and write positions that yields an output bit 1; each input is computed by a NOR gate, which I haven't drawn.6 As a result, the amount of logic circuitry to compute the length is fairly large.

The diagram below zooms in on the queue control circuitry on the die, with the main flip-flops and circuitry labeled. The circuitry in the middle computes the queue length with the 6-input NOR gate stretched across the whole region. The flip-flops for the read and write positions are in the lower region. Despite the relative simplicity of the queue circuits, they take up a substantial part of the die. Compared to modern chips, the density of the 8088 is very low; you can almost see the flip-flops with the naked eye. But this isn't all the circuitry as prefetching also required queue registers and memory cycle control circuitry. Thus, prefetching was a moderately expensive feature for the 8088, as far as die area.

The queue and prefetch circuitry on the die. The metal layer has been removed for the closeup to show the silicon of the underlying transistors.

The queue and prefetch circuitry on the die. The metal layer has been removed for the closeup to show the silicon of the underlying transistors.

The loader

To decode and execute an instruction, the Execution Unit must get instruction bytes from the Bus Interface Unit, but this is not entirely straightforward. The main problem is that the queue can be empty, in which case instruction decoding must block until a byte is available from the queue. The second problem is that instruction decoding is relatively slow so it is pipelined. For maximum performance, the decoder needs a new byte before the current instruction is finished. A circuit called the "loader" solves these problems by providing synchronization between the prefetch queue and the instruction decoder. The loader uses a small state machine to efficiently fetch bytes from the queue at the right time and to provide timing signals to the decoder and microcode engine.

In more detail, as the loader requests the first two instruction bytes from the prefetch queue, it generates two timing signals that control the microcode execution. The FC (First Clock) indicates that the first instruction byte is available, while the SC (Second Clock) indicates the second instruction byte. Note that the First Clock and Second Clock are not necessarily consecutive clock cycles because the prefetch queue could be empty or contain just one byte, in which case the First Clock and/or Second Clock would be delayed. The instruction decoding circuitry and the microcode engine are controlled by the First Clock and Second Clock signals, so they remain synchronized with the bytes supplied by the prefetch queue.

At the end of a microcode sequence, the Run Next Instruction (RNI) micro-operation causes the loader to fetch the next machine instruction. However, fetching and decoding the next instruction is a bit slow so microcode execution would be blocked for a cycle. In many cases, this slowdown can be avoided: if the microcode knows that it is one micro-instruction away from finishing, it issues a Next-to-last (NXT) micro-operation so the loader can start loading the next instruction. This achieves a degree of pipelining in most cases; fetching the next instruction is overlapped with finishing the execution of the previous instruction.

The state machine for the 8086/8088 "loader" circuit.
The 1BL signal indicates a 1-byte instruction implemented in logic rather than microcode.
From patent US4449184.

The state machine for the 8086/8088 "loader" circuit. The 1BL signal indicates a 1-byte instruction implemented in logic rather than microcode. From patent US4449184.

The diagram above shows the state machine for the loader. I won't explain it in detail, but essentially it keeps track of whether it is waiting for a First Clock byte or a Second Clock byte, and if it is performing a fetch in advance (NXT) or at the end of an instruction (RNI). The state machine is implemented with two flip-flops to support its four states.

Microcode and the prefetch queue

The loader takes care of fetching an instruction that consists of an opcode byte and a Mod R/M (addressing mode) byte. However, many instructions have additional bytes or don't follow this format For example, an opcode such as "ADD AX" can be followed by an 8- or 16-bit immediate value, adding that value to the AX register. Or a "move memory to AX" instruction can be followed by a 16-bit memory address The microcode uses a separate mechanism for fetching these instruction bytes from the queue. Specifically, each micro-instruction contains a source register and a destination register that specify a data move. By specifying "Q" (the queue) as the source, a byte is fetched from the prefetch queue. If the queue is empty, microcode execution blocks until the Bus Interface Unit loads a byte into the prefetch queue. Thus, the complexity of instruction fetching and the prefetch queue is invisible to the microcode.7

A jump, subroutine call, or other control flow change causes the prefetch queue to be flushed since the queue contents are no longer useful. This is accomplished in microcode with the FLUSH micro-instruction, which resets the queue read and write pointers and sets the MT (empty) flip-flop. Note that the queue is flushed even if the target address is in the queue, for example if you jump one byte ahead.

One complication due to the prefetch queue is that the processor's Instruction Pointer points to the next instruction to be fetched, not the next instruction to be executed. This becomes a problem for a subroutine call, which needs to push the return address. It is also a problem for a relative jump, which is computed from the current instruction. The solution is the CORR micro-instruction, which corrects the Instruction Pointer by subtracting the queue length to determine the current execution position. This is implemented by the Bus Interface Unit, which holds correction constants in the Constant ROM, and subtracts them using the address adder (not the ALU).8

The queue registers

The 8086 and 8088 partition the registers into upper registers (in the Bus Interface Unit) and lower registers (in the Execution Unit). The upper registers are the registers associated with memory accesses (e.g. Instruction Pointer, segment registers) while the lower registers are more general purpose (e.g. AX, BX, SI, SP). The upper registers are connected to two 16-bit internal buses: the B bus and the C bus.

The queue registers are physically part of the upper registers, but are wired into the buses slightly differently, as shown below. In particular, the 8088's queue registers are written 8 bits at a time from the C bus. (In contrast, the 8086's queue registers can be written 16 bits at a time to support two-byte prefetches.) When accessing the queue, the queue registers are read 16 bits at a time, but only one byte is transferred to the Q bus for instruction processing.9

The queue registers in the 8088.

The queue registers in the 8088.

The diagram below shows how the queue registers appear on the die, comparing the six-byte prefetch queue in the 8086 (top) to the four-byte 8088 queue (bottom). The 8086 prefetch registers are structured as three rows of 16-bit registers, while the 8088 prefetch registers are structured as four rows of 8-bit registers. In both cases, each bit is stored in a cross-coupled pair of inverters. The bit lines (not present) are vertical, while the control lines to select a register are horizontal. The layout is different between the processors to support 16-bit versus 8-bit writes. Note the empty space at the bottom of the 8088 registers. Because the rest of the chips are mostly the same, the 8088 couldn't be "compacted" to avoid this wasted space.

The prefetch registers in the 8086 (top) and 8088 (bottom). For the 8086, the metal and polysilicon layers were removed, exposing the underlying silicon. For the 8088, the polysilicon and silicon are visible.

The prefetch registers in the 8086 (top) and 8088 (bottom). For the 8086, the metal and polysilicon layers were removed, exposing the underlying silicon. For the 8088, the polysilicon and silicon are visible.

Intel used simulations to determine the best queue sizes for the 8086 and 8088, balancing the performance cost of prefetching against the benefit. (The cost is that prefetching makes the bus unavailable for other memory or I/O operations.) The prefetch queue is discarded on a jump instruction or other change of control flow, causing the prefetched bytes to be wasted. Thus, as the queue gets longer, the chance of discarding a prefetched byte becomes larger, so the potential benefit of prefetching becomes smaller. Since the 8088 prefetches one byte at a time, compared to two bytes at a time on the 8086, prefetching on the 8088 costs twice as much as on the 8086 in terms of bus cycles used per byte. This changes the tradeoffs in favor of a shorter queue.

Because of the difference in queue lengths, the queue control circuitry is different between the 8086 and 8088. In particular, the 8086 needs three-bit counters for the read and write positions, while the 8088 uses two-bit counters. Because of this, the length computation circuitry is also different between the processors.

I plan to continue reverse-engineering the 8088 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

  1. Whenever I mention x86's domination of the computing market, people bring up ARM, but ARM has a lot more market share in people's minds than in actual numbers. One research firm says that ARM has 15% of the laptop market share in 2023, expected to increase to 25% by 2027. (Surprisingly, Apple only has 90% of the ARM laptop market.) In the server market, just an estimated 8% of CPU shipments in 2023 were ARM. See Arm-based PCs to Nearly Double Market Share by 2027 and Digitimes. (Of course, mobile phones are almost entirely ARM.) ↩

  2. Steve Furber, co-creator of the ARM chip, mentions that "The first integrated CPUs were coincidentally quite well matched to semiconductor memory speeds, and were therefore built without caches. This can now be seen as a temporary aberration." See VLSI Risc Architecture and Organization p77. To make this concrete, the Apple II (1977) used a MOS 6502 processor running at about 1 megahertz while its 4116 DRAM chips could perform an access in 250 nanoseconds (4 times the clock speed). The 8088 processor ran at 5-10 MHz which meant that 250 ns DRAM chips were slower than the clock speed. Nowadays, processors run at 4 GHz but DRAM access speed is about 50 nanoseconds (1/200 the clock speed). ↩

  3. Modern processors use caches to improve memory performance. Accessing data from a cache is faster than accessing it from main memory, but the tradeoff is that caches are much smaller than main memory. The prefetch queues in the 8086 and 8088 are similar to a cache in some ways, but there are some key differences. First, the prefetch queue is strictly sequential. If you jump ahead two bytes, even if the prefetch queue has those instruction bytes, the processor can't use them. Second, the prefetch queue can't reuse bytes. If you have a 6-byte loop, even though all the code fits in the prefetch queue, it will be reloaded every time. Third, the prefetch queue doesn't provide any consistency. If you modify an instruction in memory a couple of bytes ahead of the PC, the 8086 or 8088 will run the old instruction if it's in the queue. ↩

  4. The design decisions for the 8086 prefetch cache (and many other aspects of the chip) are described in: J. McKevitt and J. Bayliss, "New options from big chips," in IEEE Spectrum, vol. 16, no. 3, pp. 28-34, March 1979, doi: 10.1109/MSPEC.1979.6367944. Prefetch provided a 50% performance benefit to the 8086. ↩

  5. The queue read process doesn't use an explicit read operation. Instead, the selected queue register continuously puts its value onto the queue bus. When the Execution Unit uses this byte, it sends an increment signal to the queue to advance the read pointer. If the queue empty (MT) flip-flop is set, the Execution Unit will wait until a byte is ready. ↩

  6. The NOR gates are used as AND gates, following DeMorgan's laws. For example to produce a 1 output for write position 00 and read position 01, the logic is: NOR(write bit 1', write bit 0', read bit 1', read bit 0). Note that the bits into the NOR gate are all inverted from the "desired" values; if they are all 0, the NOR output is 1. Thus, there are also some inverters on the inputs. ↩

  7. Arbitrary memory reads and writes are performed directly on memory, bypassing the prefetch queue. The 8086/8088 do not provide consistency; if you modify an instruction byte in memory and the byte is in the queue, the processor will execute the old byte. (This type of self-modifying code can be used to determine the queue length, distinguishing the 8086 from the 8088 in software.) ↩

  8. The Constant ROM is used for more than just address correction. For example, it is also used to increment the Instruction Pointer after a prefetch. Other constants are used for the 8088's string operations, which act on a block of memory. The index registers are incremented or decremented by 1 for bytes or 2 for words. When popping a value from the stack, the stack pointer is decremented using the Constant ROM. ↩

  9. Are the 8088's queue registers 16 bits wide or 8 bits wide? It's ambiguous, since the registers are written 8 bits at a time, but read 16 bits at a time. This implementation was probably selected to support the 8088's 8-bit bus while reusing as much of the 8086 design as possible. In particular, the 8088 can only prefetch one byte at a time, so writes need to happen a byte at a time. Thus, there are four control lines selecting which queue byte is written. (The 8088 could write to half of a 16-bit register but that would require moving the prefetched byte to the correct half of a 16-bit bus.) On the read side, it would make sense to have four read lines, selecting one byte from the 8088's queue. However, since the 8086 already had a multiplexer to select one byte from two, the 8088 designers probably felt it was easier to keep that circuit. And with the smaller queue on the 8088, there was no need to try to save space by removing the circuit. Thus, the queue has two read-select lines and a multiplexer control line. All these lines are controlled by the write position and read position flip-flops. ↩

How flip-flops are implemented in the Intel 8086 processor

30 September 2023 at 16:03
A key concept for a processor is the management of "state", information that persists over time. Much of a computer is built from logic gates, such as NAND or NOR gates, but logic gates have no notion of time. Processors also need a way to hold values, along with a mechanism to move from step to step in a controlled fashion. This is the role of "sequential logic", where the output depends on what happened before. Sequential logic usually operates off a clock signal,1 a sequence of regular pulses that controls the timing of the computer. (If you have a 3.2 GHz processor, for instance, that number is the clock frequency.)

A circuit called the flip-flop is a fundamental building block for sequential logic. A flip-flop can hold one bit of state, a "0" or a "1", changing its value when the clock changes. Flip-flops are a key part of processors, with multiple roles. Several flip-flops can be combined to form a register, holding a value. Flip-flops are also used to build "state machines", circuits that move from step to step in a controlled sequence. A flip-flops can also delay a signal, holding it from from one clock cycle to the next.

Intel introduced the groundbreaking 8086 microprocessor in 1978, starting the x86 architecture that is widely used today. In this blog post, I take a close look at the flip-flops in the 8086: what they do and how they are implemented. In particular, I will focus on the dynamic flip-flop, which holds its value using capacitance, much like DRAM.2 Many of these flip-flops use a somewhat unusual "enable" input, which allows the flip-flop to hold its value for multiple clock cycles.

The 8086 die under the microscope, with the main functional blocks.
I count 184 flip-flops with enable and 53 without enable.
Click this image (or any other) for a larger version.

The 8086 die under the microscope, with the main functional blocks. I count 184 flip-flops with enable and 53 without enable. Click this image (or any other) for a larger version.

The die photo above shows the silicon die of the 8086. In this image, I have removed the metal and polysilicon layers to show the silicon transistors underneath. The colored squares indicate the flip-flops: blue flip-flops have an enable input, while red lack enable. Flip-flops are used throughout the processor for a variety of roles. Around the edges, they hold the state for output pins. The control circuitry makes heavy use of flip-flops for various state machines, such as moving through the "T states" that control the bus cycle. The "loader" uses a state machine to start each instruction. The instruction register, along with some special-purpose registers (N, M, and X) are built with flip-flops. Other flip-flops track the instructions in the prefetch queue. The microcode engine uses flip-flops to hold the current microcode address as well as to latch the 21-bit output from the microcode ROM. The ALU (Arithmetic/Logic Unit) uses flip-flops to hold the status flags, temporary input values, and information on the operation.

The flip-flop circuit

In this section, I'll explain how the flip-flop circuits work, starting with a basic D flip-flop. The D flip-flop (below) takes a data input (D) and stores that value, 0 or 1. The output is labeled Q, while the inverted output is called Q (Q-bar). This flip-flop is "edge triggered", so the storage happens on the edge when the clock changes from low to high.4 Except at this transition, the input can change without affecting the output.

The symbol for a D flip-flop.

The symbol for a D flip-flop.

The 8086 implements most of its flip-flops dynamically, using pass transistor logic. That is, the capacitance of the wiring (in particular the transistor gate) holds the 0 or 1 state. The dynamic implementation is more compact than the typical static flip-flop implementation, so it is often used in processors. However, the charge on the capacitance will eventually leak away, just like DRAM (dynamic RAM). Thus, the clock must keep going or the values will be lost.3 This behavior is different from a typical flip-flop chip, which will hold its value until the next clock, whether that is a microsecond later or a day later.

The D flip-flop is built from two latch5 stages, each consisting of a pass transistor and an inverter.6 The first pass transistor passes the input value through while the clock is low. When the clock switches high, the first pass transistor turns off and isolates the inverter from the input, but the value persists due to the capacitance (blue arrow). Meanwhile, the second pass transistor switches on, passing the value from the first inverter through the second inverter to the output. Similarly, when the clock switches low, the second transistor switches off but the value is held by capacitance at the green arrow. (The circuit does not need an explicit capacitor; the wiring has enough capacitance to hold the value.) Thus, the output holds the value of the D input that was present at the moment when the clock switched from low to high. Any other changes to the D input do not affect the output.

Schematic of a D flip-flop built from pass transistor logic.

Schematic of a D flip-flop built from pass transistor logic.

The basic flip-flop can be modified by adding an "enable" input that enables or blocks the clock.7 When the enable input is high, the flip-flop records the D input on the clock edge as before, but when the enable input is low, the flip-flop holds its previous value. The enable input allows the flip-flop to hold its value for an arbitrarily long period of time.

The symbol for the D flip-flop with enable.

The symbol for the D flip-flop with enable.

The enable flip-flop is constructed from a D flip-flop by feeding the flip-flop's output back to the input as shown below. When the enable input is 0, the multiplexer selects the current Q output as the new flip-flop D input, so the flip-flop retains its previous value. But when the enable input is 1, the multiplexer selects the new D value. (You can think of the enable input as selecting "hold" versus "load".)

Block diagram of a flip-flop with an enable input.

Block diagram of a flip-flop with an enable input.

The multiplexer is implemented with two more pass transistors, as shown on the left below.8 When enable is low, the upper pass transistor switches on, passing the current Q output back to the input. When enable is high, the lower pass transistor switches on, passing the D input through to the flip-flop. The schematic below also shows how the inverted Q' output is provided by the first inverter. The circuit "cheats" a bit; since the inverted output bypasses the second transistor, this output can change before the clock edge.

Schematic of a flip-flop with an enable input.

Schematic of a flip-flop with an enable input.

The flip-flops often have a set or clear input, setting the flip-flop high or low. This input is typically connected to the processor's "reset" line, ensuring that the flip-flops are initialized to the proper state when the processor is started. The symbol below shows a flip-flop with a clear input.

The symbol for the D flip-flop with enable and clear inputs.

The symbol for the D flip-flop with enable and clear inputs.

To support the clear function, a NOR gate replaces the inverter as shown below (red). When the clear input is high, it forces the output from the NOR gate to be low. Note that the clear input is asynchronous, changing the Q output immediately. The inverted Q output, however, doesn't change until clk is high and the output cycles around. A similar modification implements a set input that forces the flip-flop high: a NOR gate replaces the first inverter.

This schematic shows the circuitry for the clear flip-flop.

This schematic shows the circuitry for the clear flip-flop.

Implementing a flip-flop in silicon

The diagram below shows two flip-flops as they appear on the die. The bright gray regions are doped silicon, the bottom layer of the chip The brown lines are polysilicon, a layer on top of the silicon. When polysilicon crosses doped silicon, a transistor is formed with a polysilicon gate. The black circles are vias (connections) to the metal layer. The metal layer on top provides wiring between the transistors. I removed the metal layer with acid to make the underlying circuitry visible. Faint purple lines remain on the die, showing where the metal wiring was.

Two flip-flops on the 8086 die.

Two flip-flops on the 8086 die.

Although the two flip-flops have the same circuitry, their layouts on the die are completely different. In the 8086, each transistor was carefully shaped and positioned to make the layout compact, so the layout depends on the surrounding logic and the connections. This is in contrast to modern standard-cell layout, which uses a standard layout for each block (logic gate, flip-flop, etc.) and puts the cells in orderly rows. (Intel moved to standard-cell wiring for much of the logic in the the 386 processor since it is much faster to create a standard-cell design than to perform manual layout.)

Conclusions

The flip-flop with enable input is a key part of the 8086, appearing throughout the processor. However, the enable input is a fairly obscure feature for a flip-flop component; most flip-flop chips have a clock input, but not an enable.9 Many FPGA and ASIC synthesis libraries, though, provide it, under the name "D flip-flop with enable" or "D flip-flop with clock enable".

I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space so you can follow me there too.

Notes and references

  1. Some early computers were asynchronous, such as von Neumann's IAS machine (1952) and its numerous descendants. In this machine, there was no centralized clock. Instead, a circuit such as an adder would send a pulse to the next circuit when it was done, triggering the next circuit in sequence. Thus, instruction execution would ripple through the computer. Although almost all later computers are synchronous, there is active research into asynchronous computing which is potentially faster and lower power. ↩

  2. I'm focusing on the dynamic flip-flops in this article, but I'll mention that the 8086 has a few latches built from cross-coupled NOR gates. Most 8086 registers use cross-coupled inverters (static memory cells) rather than flip-flops to hold bits. I explained the 8086 processor's registers in this article. ↩

  3. Dynamic circuitry is why the 8086 and many other processors have minimum clock speeds: if the clock is too slow, signals will fade away. For the 8086, the datasheet specifies a maximum clock period of 500 ns, corresponding to a minimum clock speed of 2 megahertz. The CMOS version of the Z80 processor, however, was designed so the clock could be slowed or even stopped. ↩

  4. Some flip-flops in the 8086 use the inverted clock, so they transition when the clock switches from high to low. Thus, there are two sets of transitions in the 8068 for each clock cycle. ↩

  5. The terminology gets confusing between flip-flops and latches, which sometimes refer to the same thing and sometimes different things. The term "latch" is often used for a flip-flop that operates on the clock level, not the clock edge. That is, when the clock input is high, the input passes through, and when the clock input is low, the value is retained. Confusingly, the clock for a latch is often called "enable". This is different from the enable input that I'm discussing, which is separate from the clock. ↩

  6. I asked an Intel chip engineer if they designed the circuitry in the 8086 era in terms of flip-flops. He said that they typically designed the circuitry in terms of the underlying pass transistors and gates, rather than using the flip-flop as a fundamental building block. ↩

  7. You might wonder why the clock and enable are separate inputs. Why couldn't you just AND them together so when enable is low, it will block the clock and the flip-flop won't transition? That mostly works, but three factors make it a bad idea. First, the idea of using a clock is so everything changes state at the same time. If you start putting gates in the clock path, the clock gets a bit delayed and shifts the timing. If the delay is too large, the input value might change before the flip-flop can latch it. Thus, putting gates in the clock path is frowned upon. The second factor is that combining the clock and enable signals risks race conditions. For instance, suppose that the enable input goes low and high while the clock remains high. If you AND the two signals together, this will yield a spurious clock edge, causing the flip-flop to latch its input a second time. Finally, if you block the clock for too long, a dynamic flip-flop will lose its value. (Note that the flip-flop circuit used in the 8086 will refresh its value on each clock even if the enable input is held low for a long period of time.) ↩

  8. A multiplexer can be implemented with logic gates. However, it is more compact to implement it with pass transistors. The pass transistor implementation takes four transistors (two fewer if the inverted enable signal is already available). A logic gate implementation would take about nine transistors: an AND-OR-INVERT gate, an inverter on the output, and an inverter for the enable signal. ↩

  9. The common 7474 is a typical TTL flip-flop that does not have an enable input. Chips with an enable are rarer, such as the 74F377. Strangely, one manufacturer of the 74HC377 shows the enable as affecting the output; I think they simply messed up the schematic in the datasheet since it contradicts the function table.

    Some examples of standard-cell libraries with enable flip-flops: Cypress SoC, Faraday standard cell library, Xilinx Unified Libraries, Infineon PSoC 4 Components, Intel's CHMOS-III cell library (probably used for the 386 processor), and Intel Quartus FPGA. ↩

Tracing the roots of the 8086 instruction set to the Datapoint 2200 minicomputer

12 August 2023 at 16:26

The Intel 8086 processor started the x86 architecture that is still extensively used today. The 8086 has some quirky characteristics: it is little-endian, has a parity flag, and uses explicit I/O instructions instead of just memory-mapped I/O. It has four 16-bit registers that can be split into 8-bit registers, but only one that can be used for memory indexing. Surprisingly, the reason for these characteristics and more is compatibility with a computer dating back before the creation of the microprocessor: the Datapoint 2200, a minicomputer with a processor built out of TTL chips. In this blog post, I'll look in detail at how the Datapoint 2200 led to the architecture of Intel's modern processors, step by step through the 8008, 8080, and 8086 processors.

The Datapoint 2200

In the late 1960s, 80-column IBM punch cards were the primary way of entering data into computers, although CRT terminals were growing in popularity. The Datapoint 2200 was designed as a low-cost terminal that could replace a keypunch, with a squat CRT display the size of a punch card. By putting some processing power into the Datapoint 2200, it could perform data validation and other tasks, making data entry more efficient. Even though the Datapoint 2200 was typically used as an intelligent terminal, it was really a desktop minicomputer with a "unique combination of powerful computer, display, and dual cassette drives." Although now mostly forgotten, the Datapoint 2200 was the origin of the 8-bit microprocessor, as I'll explain below.

The Datapoint 2200 computer (Version II).

The Datapoint 2200 computer (Version II).

The memory storage of the Datapoint 2200 had a large impact on its architecture and thus the architecture of today's computers. In the 1960s and early 1970s, magnetic core memory was the dominant form of computer storage. It consisted of tiny ferrite rings, threaded into grids, with each ring storing one bit. Magnetic core storage was bulky and relatively expensive, though. Semiconductor RAM was new and very expensive; Intel's first product in 1969 was a RAM chip called the 3101, which held just 64 bits and cost $99.50. To minimize storage costs, the Datapoint 2200 used an alternative: MOS shift-register memory. The Intel 1405 shift-register memory chip provided much more storage than RAM chips at a much lower cost (512 bits for $13.30).1

Intel 1405 shift-register memory chips in metal cans, in the Datapoint 2200.

Intel 1405 shift-register memory chips in metal cans, in the Datapoint 2200.

The big problem with shift-register memory is that it is sequential: the bits come out one at a time, in the same order you put them in. This wasn't a problem when executing instructions sequentially, since the memory provided each instruction as it was needed. For a random access, though, you need to wait until the bits circulate around and you get the one you want, which is very slow. To minimize the number of memory accesses, the Datapoint 2200 had seven registers, a relatively large number of registers for the time.2 The registers were called A, B, C, D, E, H, and L, and these names had a lasting impact on Intel processors.

Another consequence of shift-register memory was that the Datapoint 2200 was a serial computer, operating on one bit at a time as the shift-register memory provided it, using a 1-bit ALU. To handle arithmetic operations, the ALU needed to start with the lowest bit so it could process carries. Likewise, a 16-bit value (such as a jump target) needed to start with the lowest bit. This resulted in a little-endian architecture, with the low byte first. The little-endian architecture has remained in Intel processors to the present.

Since the Datapoint 2200 was designed before the creation of the microprocessor, its processor was built from a board of TTL chips (as was typical for minicomputers at the time). The diagram below shows the processor board with the chips categorized by function. The board has a separate chip for each 8-bit register (B, C, D, etc.) and separate chips for control flags (Z, carry, etc.). The Arithmetic/Logic Unit (ALU) takes about 18 chips, while instruction decoding is another 18 chips. Because every feature required more chips, the designers of the Datapoint 2200 were strongly motivated to make the instruction set as simple as possible. This was necessary since the Datapoint 2200 was a low-cost device, renting for just $148 a month. In contrast, the popular PDP-8 minicomputer rented for $500 a month.

The Datapoint 2200 processor board with registers, flags, and other blocks labeled. Click this image (or any other) for a larger version.

The Datapoint 2200 processor board with registers, flags, and other blocks labeled. Click this image (or any other) for a larger version.

One way that the Datapoint 2200 simplified the hardware was by creating a large set of instructions by combining simpler pieces in an orthogonal way. For instance, the Datapoint 2200 has 64 ALU instructions that apply one of eight ALU operations to one of the eight registers. This requires a small amount of hardwareβ€”eight ALU circuits and a circuit to select the registerβ€”but provides a large number of instructions. Another example is the register-to-register move instructions. Specifying one of eight source registers and one of eight destination registers provides a large, flexible set of instructions to move data.

The Datapoint 2200's instruction format was designed around this principle, with groups of three bits specifying a register. A common TTL chip could decode the group of three bits and activate the desired circuit.3 For instance, a data move instruction had the bit pattern 11DDDSSS to move a byte from the specified source (SSS) to the specified destination (DDD). (Note that this bit pattern maps onto three octal digits very nicely since the source and destination are separate digits.4)

One unusual feature of the Datapoint instruction set is that a memory access was just like a register access. That is, an instruction could specify one of the seven physical registers or could specify a memory access (M), using the identical instruction format. One consequence of this is that you couldn't include a memory address in an instruction. Instead, memory could only be accessed by first loading the address into the H and L registers, which held the high and low byte of the address respectively.5 This is very unusual and inconvenient, since a memory access took three instructions: two to load the H and L registers and one to access memory as the M "register". The advantage was that it simplified the instruction set and the decoding logic, saving chips and thus reducing the system cost. This decision also had lasting impact on Intel processors and how they access memory.

The table below shows the Datapoint 2200's instruction set in an octal table showing the 256 potential opcodes.6 I have roughly classified the instructions as arithmetic/logic (purple), control-flow (blue), data movement (green), input/output (orange), and miscellaneous (yellow). Note how the orthogonal instruction format produces large blocks of related instructions. The instructions in the lower right (green) load (L) a value from a source to a destination. (The no-operation NOP and HALT instructions are special cases.7) In the upper-left are Load operations (LA, etc.) that use an "immediate" byte, a data byte that follows the instruction. They use the same DDD code to specify the destination register, reusing that circuitry.

Β 0123456701234567
0HALTHALTSLCRFCADΒ LARETURNJFCINPUTCFCΒ JMPΒ CALLΒ 
1Β Β SRCRFZACΒ LBΒ JFZΒ CFZΒ Β Β Β Β 
2Β Β Β RFSSUΒ LCΒ JFSEX ADRCFSEX STATUSΒ EX DATAΒ EX WRITE
3Β Β Β RFPSBΒ LDΒ JFPEX COM1CFPEX COM2Β EX COM3Β EX COM4
4Β Β Β RTCNDΒ LEΒ JTCΒ CTCΒ Β Β Β Β 
5Β Β Β RTZXRΒ LHΒ JTZEX BEEPCTZEX CLICKΒ EX DECK1Β EX DECK2
6Β Β Β RTSORΒ LLΒ JTSEX RBKCTSEX WBKΒ Β Β EX BSP
7Β Β Β RTPCPΒ Β Β JTPEX SFCTPEX SBΒ EX REWNDΒ EX TSTOP
0ADAADBADCADDADEADHADLADMNOPLABLACLADLAELAHLALLAM
1ACAACBACCACDACEACHACLACMLBALBBLBCLBDLBELBHLBLLBM
2SUASUBSUCSUDSUESUHSULSUMLCALCBLCCLCDLCELCHLCLLCM
3SBASBBSBCSBDSBESBHSBLSBMLDALDBLDCLDDLDELDHLDLLDM
4NDANDBNDCNDDNDENDHNDLNDMLEALEBLECLEDLEELEHLELLEM
5XRAXRBXRCXRDXREXRHXRLXRMLHALHBLHCLHDLHELHHLHLLHM
6ORAORBORCORDOREORHORLORMLLALLBLLCLLDLLELLHLLLLLM
7CPACPBCPCCPDCPECPHCPLCPMLMALMBLMCLMDLMELMHLMLHALT

The lower-left quadrant (purple) has the bulk of the ALU instructions. These instructions have a regular, orthogonal structure making the instructions easy to decode: each row specifies the operation while each column specifies the source. This is due to the instruction structure: eight bits in the pattern 10AAASSS, where the AAA bits specified the ALU operation and the SSS bits specified the register source. The three-bit ALU code specifies the operations Add, Add with Carry, Subtract, Subtract with Borrow, logical AND, logical XOR, logical OR, and Compare. This list is important because it defined the fundamental ALU operations for later Intel processors.8 In the upper-left are ALU operations that use an "immediate" byte. These instructions use the same AAA bit pattern to select the ALU operation, reusing the decoding hardware. Finally, the shift instructions SLC and SRC are implemented as special cases outside the pattern.

The upper columns contain conditional instructions in blueβ€”Return, Jump, and Call. The eight conditions test the four status flags (Carry, Zero, Sign, and Parity) for either True or False. (For example, JFZ Jumps if the Zero flag is False.) A 3-bit field selects the condition, allowing it to be easily decoded in hardware. The parity flag is somewhat unusual because parity is surprisingly expensive to compute in hardware, but because the Datapoint 2200 operated as a terminal, parity computation was important.

The Datapoint 2200 has an input instruction as well as many output instructions for a variety of specific hardware tasks (orange, labeled EX for external). Typical operations are STATUS to get I/O status, BEEP and CLICK to make sound, and REWIND to rewind the tape. As a result of this decision to use separate I/O instructions, Intel processors still use I/O instructions operating in an I/O space, different from processors such as the MOS 6502 and the Motorola 68000 that used memory-mapped I/O.

To summarize, the Datapoint 2200 has a fairly large number of instructions, but they are generated from about a dozen simple patterns that are easy to decode.9 By combining orthogonal bit fields (e.g. 8 ALU operations multiplied by 8 source registers), 64 instructions can be generated from one underlying pattern.

Intel 8008

The Intel 8008 was created as a clone of the Datapoint 2200 processor.10 Around the end of 1969, the Datapoint company talked with Intel and Texas Instruments about the possibility of replacing the processor board with a single chip. Even though the microprocessor didn't exist at this point, both companies said they could create such a chip. Texas Instruments was first with a chip called the TMX 1795 that they advertised as a "CPU on a chip". Slightly later, Intel produced the 8008 microprocessor. Both chips copied the Datapoint 2200's instruction set architecture with minor changes.

The Intel 8008 chip in its 18-pin package. The small number of pins hampered the performance of the 8008, but Intel was hesitant to even go to the 18-pin package. Photo by Thomas Nguyen, (CC BY-SA 4.0).

The Intel 8008 chip in its 18-pin package. The small number of pins hampered the performance of the 8008, but Intel was hesitant to even go to the 18-pin package. Photo by Thomas Nguyen, (CC BY-SA 4.0).

By the time the chips were completed, however, the Datapoint corporation had lost interest in the chips. They were designing a much faster version of the Datapoint 2200 with improved TTL chips (including the well-known 74181 ALU chip). Even the original Datapoint 2200 model was faster than the Intel 8008 processor, and the Version II was over 5 times faster,11 so moving to a single-chip processor would be a step backward.

Texas Instruments unsuccessfully tried to find a customer for their TMX 1795 chip and ended up abandoning the chip. Intel, however, marketed the 8008 as an 8-bit microprocessor, essentially creating the microprocessor industry. In my view, Intel's biggest innovation with the microprocessor wasn't creating a single-chip CPU, but creating the microprocessor as a product category: a general-purpose processor along with everything customers needed to take advantage of it. Intel put an enormous amount of effort into making microprocessors a success: from documentation and customer training to Intellec development systems, from support chips to software tools such as assemblers, compilers, and operating systems.

The table below shows the opcodes of the 8008. For the most part, the 8008 copies the Datapoint 2200, with identical instructions that have identical opcodes (in color). There are a few additional instructions (shown in white), though. Intel Designer Ted Hoff realized that increment and decrement instructions (IN and DC) would be very useful for loops. There are two additional bit rotate instructions (RAL and RAR) as well as the "missing" LMI (Load Immediate to Memory) instruction. The RST (restart) instructions act as short call instructions to fixed addresses for interrupt handling. Finally, the 8008 turned the Datapoint 2200's device-specific I/O instructions into 32 generic I/O instructions.

Β 0123456701234567
0HLTHLTRLCRFCADIRST 0LAIRETJFCINP 0CFCINP 1JMPINP 2CALINP 3
1INBDCBRRCRFZACIRST 1LBIΒ JFZINP 4CFZINP 5Β INP 6Β INP 7
2INCDCCRALRFSSUIRST 2LCIΒ JFSOUT 8CFSOUT 9Β OUT 10Β OUT 11
3INDDCDRARRFPSBIRST 3LDIΒ JFPOUT 12CFPOUT 13Β OUT 14Β OUT 15
4INEDCEΒ RTCNDIRST 4LEIΒ JTCOUT 16CTCOUT 17Β OUT 18Β OUT 19
5INHDCHΒ RTZXRIRST 5LHIΒ JTZOUT 20CTZOUT 21Β OUT 22Β OUT 23
6INLDCLΒ RTSORIRST 6LLIΒ JTSOUT 24CTSOUT 25Β OUT 26Β OUT 27
7Β Β Β RTPCPIRST 7LMIΒ JTPOUT 28CTPOUT 29Β OUT 30Β OUT 31
0ADAADBADCADDADEADHADLADMNOPLABLACLADLAELAHLALLAM
1ACAACBACCACDACEACHACLACMLBALBBLBCLBDLBELBHLBLLBM
2SUASUBSUCSUDSUESUHSULSUMLCALCBLCCLCDLCELCHLCLLCM
3SBASBBSBCSBDSBESBHSBLSBMLDALDBLDCLDDLDELDHLDLLDM
4NDANDBNDCNDDNDENDHNDLNDMLEALEBLECLEDLEELEHLELLEM
5XRAXRBXRCXRDXREXRHXRLXRMLHALHBLHCLHDLHELHHLHLLHM
6ORAORBORCORDOREORHORLORMLLALLBLLCLLDLLELLHLLLLLM
7CPACPBCPCCPDCPECPHCPLCPMLMALMBLMCLMDLMELMHLMLHLT

Intel 8080

The 8080 improved the 8008 in many ways, focusing on speed and ease of use, and resolving customer issues with the 8008.12 Customers had criticized the 8008 for its small memory capacity, low speed, and difficult hardware interfacing. The 8080 increased memory capacity from 16K to 64K and was over an order of magnitude faster than the 8008. The 8080 also moved to a 40-pin package that made interfacing easier, but the 8080 still required a large number of support chips to build a working system.

Although the 8080 was widely used in embedded systems, it is more famous for its use in the first generation of home computers, boxes such as the Altair and IMSAI. Famed chip designer Federico Faggin said that the 8080 really created the microprocessor; the 4004 and 8008 suggested it, but the 8080 made it real.13

Altair 8800 computer on display at the Smithsonian. Photo by Colin Douglas, (CC BY-SA 2.0).

Altair 8800 computer on display at the Smithsonian. Photo by Colin Douglas, (CC BY-SA 2.0).

The table below shows the instruction set for the 8080. The 8080 was designed to be compatible with 8008 assembly programs after a simple translation process; the instructions have been shifted around and the names have changed.15 The instructions from the Datapoint 2200 (colored) form the majority of the 8080's instruction set. The instruction set was expanded by adding some 16-bit support, allowing register pairs (BC, DE, HL) to be used as 16-bit registers for double add, 16-bit increment and decrement, and 16-bit memory transfers. Many of the new instructions in the 8080 may seem like contrived special casesβ€” for example, SPHL (Load SP from HL) and XCHG (Exchange DE and HL)β€” but they made accesses to memory easier. The I/O instructions from the 8008 have been condensed to just IN and OUT, opening up room for new instructions.

Β 0123456701234567
0NOPLXI BSTAX BINX BINR BDCR BMVI BRLCMOV B,BMOV B,CMOV B,DMOV B,EMOV B,HMOV B,LMOV B,MMOV B,A
1Β DAD BLDAX BDCX BINR CDCR CMVI CRRCMOV C,BMOV C,CMOV C,DMOV C,EMOV C,HMOV C,LMOV C,MMOV C,A
2Β LXI DSTAX DINX DINR DDCR DMVI DRALMOV D,BMOV D,CMOV D,DMOV D,EMOV D,HMOV D,LMOV D,MMOV D,A
3Β DAD DLDAX DDCX DINR EDCR EMVI ERARMOV E,BMOV E,CMOV E,DMOV E,EMOV E,HMOV E,LMOV E,MMOV E,A
4Β LXI HSHLDINX HINR HDCR HMVI HDAAMOV H,BMOV H,CMOV H,DMOV H,EMOV H,HMOV H,LMOV H,MMOV H,A
5Β DAD HLHLDDCX HINR LDCR LMVI LCMAMOV L,BMOV L,CMOV L,DMOV L,EMOV L,HMOV L,LMOV L,MMOV L,A
6Β LXI SPSTAINX SPINR MDCR MMVI MSTCMOV M,BMOV M,CMOV M,DMOV M,EMOV M,HMOV M,LHLTMOV M,A
7Β DAD SPLDADCX SPINR ADCR AMVI ACMCMOV A,BMOV A,CMOV A,DMOV A,EMOV A,HMOV A,LMOV A,MMOV A,A
0ADD BADD CADD DADD EADD HADD LADD MADD ARNZPOP BJNZJMPCNZPUSH BADIRST 0
1ADC BADC CADC DADC EADC HADC LADC MADC ARZRETJZΒ CZCALLACIRST 1
2SUB BSUB CSUB DSUB ESUB HSUB LSUB MSUB ARNCPOP DJNCOUTCNCPUSH DSUIRST 2
3SBB BSBB CSBB DSBB ESBB HSBB LSBB MSBB ARCΒ JCINCCΒ SBIRST 3
4ANA BANA CANA DANA EANA HANA LANA MANA ARPOPOP HJPOXTHLCPOPUSH HANIRST 4
5XRA BXRA CXRA DXRA EXRA HXRA LXRA MXRA ARPEPCHLJPEXCHGCPEΒ XRIRST 5
6ORA BORA CORA DORA EORA HORA LORA MORA ARPPOP PSWJPDICPPUSH PSWORIRST 6
7CMP BCMP CCMP DCMP ECMP HCMP LCMP MCMP ARMSPHLJMEICMΒ CPIRST 7

The 8080 also moved the stack to external memory, rather than using an internal fixed special-purpose stack as in the 8008 and Datapoint 2200. This allowed PUSH and POP instructions to put register data on the stack. Interrupt handling was also improved by adding the Enable Interrupt and Disable Interrupt instructions (EI and DI).14

Intel 8085

The Intel 8085 was designed as a "mid-life kicker" for the 8080, providing incremental improvements while maintaining compatibility. From the hardware perspective, the 8085 was much easier to use than the 8080. While the 8080 required three voltages, the 8085 required a single 5-volt power supply (represented by the "5" in the part number). Moreover, the 8085 eliminated most of the support chips required with the 8080; a working 8085 computer could be built with just three chips. Finally, the 8085 provided additional hardware functionality: better interrupt support and serial I/O.

The Intel 8085, like the 8080 and the 8086, was packaged in a 40-pin DIP. Photo by Thomas Nguyen, (CC BY-SA 4.0).

The Intel 8085, like the 8080 and the 8086, was packaged in a 40-pin DIP. Photo by Thomas Nguyen, (CC BY-SA 4.0).

On the software side, the 8085 is curious: 12 instructions were added to the instruction set (finally using every opcode), but all but two were hidden and left undocumented.16 Moreover, the 8085 added two new condition codes, but these were also hidden. This situation occurred because the 8086 project started up in 1976, near the release of the 8085 chip. Intel wanted the 8086 to be compatible (to some extent) with the 8080 and 8085, but providing new instructions in the 8085 would make compatibility harder. It was too late to remove the instructions from the 8085 chip, so Intel did the next best thing and removed them from the documentation. These instructions are shown in red in the table below. Only the new SIM and RIM instructions were supported, necessary in order to use the 8085's new interrupt and serial I/O features.

Β 0123456701234567
0NOPLXI BSTAX BINX BINR BDCR BMVI BRLCMOV B,BMOV B,CMOV B,DMOV B,EMOV B,HMOV B,LMOV B,MMOV B,A
1DSUBDAD BLDAX BDCX BINR CDCR CMVI CRRCMOV C,BMOV C,CMOV C,DMOV C,EMOV C,HMOV C,LMOV C,MMOV C,A
2ARHLLXI DSTAX DINX DINR DDCR DMVI DRALMOV D,BMOV D,CMOV D,DMOV D,EMOV D,HMOV D,LMOV D,MMOV D,A
3RDELDAD DLDAX DDCX DINR EDCR EMVI ERARMOV E,BMOV E,CMOV E,DMOV E,EMOV E,HMOV E,LMOV E,MMOV E,A
4RIMLXI HSHLDINX HINR HDCR HMVI HDAAMOV H,BMOV H,CMOV H,DMOV H,EMOV H,HMOV H,LMOV H,MMOV H,A
5LDHIDAD HLHLDDCX HINR LDCR LMVI LCMAMOV L,BMOV L,CMOV L,DMOV L,EMOV L,HMOV L,LMOV L,MMOV L,A
6SIMLXI SPSTAINX SPINR MDCR MMVI MSTCMOV M,BMOV M,CMOV M,DMOV M,EMOV M,HMOV M,LHLTMOV M,A
7LDSIDAD SPLDADCX SPINR ADCR AMVI ACMCMOV A,BMOV A,CMOV A,DMOV A,EMOV A,HMOV A,LMOV A,MMOV A,A
0ADD BADD CADD DADD EADD HADD LADD MADD ARNZPOP BJNZJMPCNZPUSH BADIRST 0
1ADC BADC CADC DADC EADC HADC LADC MADC ARZRETJZRSTVCZCALLACIRST 1
2SUB BSUB CSUB DSUB ESUB HSUB LSUB MSUB ARNCPOP DJNCOUTCNCPUSH DSUIRST 2
3SBB BSBB CSBB DSBB ESBB HSBB LSBB MSBB ARCSHLXJCINCCJNKSBIRST 3
4ANA BANA CANA DANA EANA HANA LANA MANA ARPOPOP HJPOXTHLCPOPUSH HANIRST 4
5XRA BXRA CXRA DXRA EXRA HXRA LXRA MXRA ARPEPCHLJPEXCHGCPELHLXXRIRST 5
6ORA BORA CORA DORA EORA HORA LORA MORA ARPPOP PSWJPDICPPUSH PSWORIRST 6
7CMP BCMP CCMP DCMP ECMP HCMP LCMP MCMP ARMSPHLJMEICMJKCPIRST 7

Intel 8086

Following the 8080, Intel intended to revolutionize microprocessors with a 32-bit "micro-mainframe", the iAPX 432. This extremely complex processor implemented objects, memory management, interprocess communication, and fine-grained memory protection in hardware. The iAPX 432 was too ambitious and the project fell behind schedule, leaving Intel vulnerable against competitors such as Motorola and Zilog. Intel quickly threw together a 16-bit processor as a stopgap until the iAPX 432 was ready; to show its continuity with the 8-bit processor line, this processor was called the 8086. The iAPX 432 ended up being one of the great disaster stories of modern computing and quietly disappeared.

The "stopgap" 8086 processor, however, started the x86 architecture that changed the history of Intel. The 8086's victory was powered by the IBM PC, designed in 1981 around the Intel 8088, a variant of the 8086 with a cheaper 8-bit bus. The IBM PC was a rousing success, defining the modern computer and making Intel's fortune. Intel produced a succession of more powerful chips that extended the 8086: 286, 386, 486, Pentium, and so on, leading to the current x86 architecture.

The original IBM PC used the Intel 8088 processor, a variant of the 8086 with an 8-bit bus. Photo by Ruben de Rijcke, (CC BY-SA 3.0).

The original IBM PC used the Intel 8088 processor, a variant of the 8086 with an 8-bit bus. Photo by Ruben de Rijcke, (CC BY-SA 3.0).

The 8086 was a major change from the 8080/8085, jumping from an 8-bit architecture to a 16-bit architecture and expanding from 64K of memory to 1 megabyte. Nonetheless, the 8086's architecture is closely related to the 8080. The designers of the 8086 wanted it to be compatible with the 8080/8085, but the difference was too wide for binary compatibility or even assembly-language compatibility. Instead, the 8086 was designed so a program could translate 8080 assembly language to 8086 assembly language.17 To accomplish this, each 8080 register had a corresponding 8086 register and most 8080 instructions had corresponding 8086 instructions.

The 8086's instruction set was designed with a new concept, the "ModR/M" byte, which usually follows the opcode byte. The ModR/M byte specifies the memory addressing mode and the register (or registers) to use, allowing that information to be moved out of the opcode. For instance, where the 8080 had a quadrant of 64 instructions to move from register to register, the 8086 has a single move instruction, with the ModR/M byte specifying the particular instruction. (The move instruction, however, has variants to handle byte vs. word operations, moves to or from memory, and so forth, so the 8086 ends up with a few move opcodes.) The ModR/M byte preserves the Datapoint 2200's concept of using the same instruction for memory and register operations, but allows a memory address to be provided in the instruction.

The 8086 also cleans up some of the historical baggage in the instruction set, freeing up space in the precious 256 opcodes for new instructions. The conditional call and return instructions were eliminated, while the conditional jumps were expanded. The 8008's RST (Restart) instructions were eliminated, replaced by interrupt vectors.

The 8086 extended its registers to 16 bits and added several new registers. An Intel patent (below) shows that the 8086's registers were originally called A, B, C, D, E, H, and L, matching the Datapoint 2200. The A register was extended to the 16-bit XA register, while the BC, DE, and HL registers were used unchanged. When the 8086 was released, these registers were renamed to AX, CX, DX, and BX respectively.18 In particular, the HL register was renamed to BX; this is why BX can specify a memory address in the ModR/M byte, but AX, CX, and DX can't.

A patent diagram showing the 8086's registers with their original names.  (MP, IJ, and IK are now known as BP, SI, and DI.) From patent US4449184.

A patent diagram showing the 8086's registers with their original names. (MP, IJ, and IK are now known as BP, SI, and DI.) From patent US4449184.

The table below shows the 8086's instruction set, with "b", "w", and "i" indicating byte (8-bit), word (16-bit), and immediate instructions. The Datapoint 2200 instructions (colored) are all still supported. The number of Datapoint instructions looks small because the ModR/M byte collapses groups of old opcodes into a single new one. This opened up space in the opcode table, though, allowing the 8086 to have many new instructions as well as 16-bit instructions.19

Β 0123456701234567
0ADD bADD wADD bADD wADD biADD wiPUSH ESPOP ESINC AXINC CXINC DXINC BXINC SPINC BPINC SIINC DI
1OR bOR wOR bOR wOR biOR wiPUSH CSΒ DEC AXDEC CXDEC DXDEC BXDEC SPDEC BPDEC SIDEC DI
2ADC bADC wADC bADC wADC biADC wiPUSH SSPOP SSPUSH AXPUSH CXPUSH DXPUSH BXPUSH SPPUSH BPPUSH SIPUSH DI
3SBB bSBB wSBB bSBB wSBB biSBB wiPUSH DSPOP DSPOP AXPOP CXPOP DXPOP BXPOP SPPOP BPPOP SIPOP DI
4AND bAND wAND bAND wAND biAND wiES:DAAΒ Β Β Β Β Β Β Β 
5SUB bSUB wSUB bSUB wSUB biSUB wiCS:DASΒ Β Β Β Β Β Β Β 
6XOR bXOR wXOR bXOR wXOR biXOR wiSS:AAAJOJNOJBJNBJZJNZJBEJA
7CMP bCMP wCMP bCMP wCMP biCMP wiDS:AASJSJNSJPEJPOJLJGEJLEJG
0GRP1 bGRP1 wGRP1 bGRP1 wTEST bTEST wXCHG bXCHG wΒ Β RETRETLESLDSMOV bMOV w
1MOV bMOV wMOV bMOV wMOV srLEAMOV srPOPΒ Β RETFRETFINT 3INTINTOIRET
2NOPXCHG CXXCHG DXXCHG BXXCHG SPXCHG BPXCHG SIXCHG DIShift bShift wShift bShift wAAMAADΒ XLAT
3CBWCWDCALLWAITPUSHFPOPFSAHFLAHFESC 0ESC 1ESC 2ESC 3ESC 4ESC 5ESC 6ESC 7
4MOV AL,MMOV AX,MMOV M,ALMOV M,AXMOVS bMOVS wCMPS bCMPS wLOOPNZLOOPZLOOPJCXZIN bIN wOUT bOUT w
5TEST bTEST wSTOS bSTOS wLODS bLODS wSCAS bSCAS wCALLJMPJMPJMPIN bIN wOUT b DXOUT w DX
6MOV AL,iMOV CL,iMOV DL,iMOV BL,iMOV AH,iMOV CH,iMOV DH,iMOV BH,iLOCKΒ REPNZREPZHLTCMCGRP3aGRP3b
7MOV AX,iMOV CX,iMOV DX,iMOV BX,iMOV SP,iMOV BP,iMOV SI,iMOV DI,iCLCSTCCLISTICLDSTDGRP4GRP5

The 8086 has a 16-bit flags register, shown below, but the low byte remained compatible with the 8080. The four highlighted flags (sign, zero, parity, and carry) are the ones originating in the Datapoint 2200.

The flag word of the 8086 contains the original Datapoint 2200 registers.

The flag word of the 8086 contains the original Datapoint 2200 registers.

Modern x86 and x86-64

The modern x86 architecture has extended the 8086 to a 32-bit architecture (IA-32) and a 64-bit architecture (x86-6420), but the Datapoint features remain. At startup, an x86 processor runs in "real mode", which operates like the original 8086. More interesting is 64-bit mode, which has some major architectural changes. In 64-bit mode, the 8086's general-purpose registers are extended to sixteen 64-bit registers (and soon to be 32 registers). However, the original Datapoint registers are special and can still be accessed as byte registers within the corresponding 64-bit register; these are highlighted in the table below.21

General purpose registers in x86-64. From Intel Software Developer's Manual.

General purpose registers in x86-64. From Intel Software Developer's Manual.

The flag register of the 8086 was extended to 32 bits or 64 bits in x86. As the diagram below shows, the original Datapoint 2200 status flags are still there (highlighted in yellow).

The 32-bit and 64-bit flags of x86 contain the original Datapoint 2200 registers. From Intel Software Developer's Manual.

The 32-bit and 64-bit flags of x86 contain the original Datapoint 2200 registers. From Intel Software Developer's Manual.

The instruction set in x86 has been extended from the 8086, mostly through prefixes, but the instructions from the Datapoint 2200 are still there. The ModR/M byte was changed in 32-bit mode so the BX (originally HL) register is no longer special when accessing memory (although it's still special with 16-bit addressing, until Intel removes that in the upcoming x86-S simplification.) I/O ports still exist in x86, although they are viewed as more of a legacy feature: modern I/O devices typically use memory-mapped I/O instead of I/O ports. To summarize, fifty years later, x86-64 is slowly moving away from some of the Datapoint 2200 features, but they are still there.

Conclusions

The modern x86 architecture is descended from the Datapoint 2200's architecture. Because there is backward-compatibility at each step, you should theoretically be able to take a Datapoint 2200 binary, disassemble it to 8008 assembly, automatically translate it to 8080 assembly, automatically convert it to 8086 assembly, and then run it on a modern x86 processor. (The I/O devices would be different and cause trouble, of course.)

The Datapoint 2200's complete instruction set, its flags, and its little-endian architecture have persisted into current processors. This shows the critical importance of backward compatibility to customers. While Intel keeps attempting to create new architectures (iAPX 432, i960, i860, Itanium), customers would rather stay on a compatible architecture. Remarkably, Intel has managed to move from 8-bit computers to 16, 32, and 64 bits, while keeping systems mostly compatible. As a result, design decisions made for the Datapoint 2200 over 50 years ago are still impacting modern computers. Will processors still have the features of the Datapoint 2200 another fifty years from now? I wouldn't be surprised.22

Thanks to Joe Oberhauser for suggesting this topic. I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space so you can follow me there too.

Notes and references

  1. Shift-register memory was also used in the TV Typewriter (1973) and the display storage of the Apple I (1976). However, dynamic RAM (DRAM) rapidly dropped in price, making shift-register memory obsolete by the mid 1970s. (I wrote about the Intel 1405 shift register memory in detail in this article.) ↩

  2. For comparison, the popular PDP-8 minicomputer had just two main registers: the accumulator and a multiplier-quotient register; instructions typically operated on the accumulator and a memory location. The Data General Nova, a minicomputer released in 1969, had four accumulator / index registers. Mainframes generally had many more registers; the IBM System/360 (1964), for instance, had 16 general registers and four floating-point registers. ↩

  3. On the hardware side, instructions were decoded with BCD-to-decimal decoder chips (type 7442). These decoders normally decoded a 4-bit BCD value into one of 10 output lines. In the Datapoint 2200, they decoded a 3-bit value into one of 8 output lines, and the other two lines were ignored. This allowed the high-bit line to be used as a selection line; if it was set, none of the 8 outputs would be active. ↩

  4. These bit patterns map cleanly onto octal, so the opcodes are clearest when specified in octal. This octal structure has persisted in Intel processors including modern x86 processors. Unfortunately, Intel invariably specifies the opcodes in hexadecimal rather than octal, which obscures the underlying structure. This structure is described in detail in The 80x86 is an Octal Machine. ↩

  5. It is unusual for an instruction set to require memory addresses to be loaded into a register in order to access memory. This technique was common in microcode, where memory addresses were loaded into the Memory Address Register (MAR). As pwg pointed out, the CDC mainframes (e.g. 6600) had special address registers; when you changed an address register, the specified memory location was read or written to the corresponding operand register automatically.

    At first, I thought that serial memory might motivate the use of an address register, but I don't think there's a connection. Most likely, the Datapoint 2200 used these techniques to create a simple, orthogonal instruction set that was easy to decode, and they weren't particularly concerned with performance. ↩

  6. The instruction tables in this article are different from most articles, because I use octal instead of hexadecimal. (Displaying an octal-based instruction in a hexadecimal table obscures much of the underlying structure.) To display the table in octal, I break it into four quadrants based on the top octal digit of a three-digit opcode: 0, 1, 2, or 3. The digit 0-7 along the left is the middle octal digit and the digit along the top is the low octal digit. ↩

  7. The regular pattern of Load instructions is broken by the NOP and HALT instructions. All the register-to-register load instructions along the diagonal accomplish nothing since they move a register to itself, but only the first one is explicitly called NOP. Moving a memory location to itself doesn't make sense, so its opcode is assigned the HALT instruction. Note that the all-0's opcode and the all-1's opcode are both HALT instructions. This is useful since it can stop execution if the program tries executing uninitialized memory. ↩

  8. You might think that Datapoint and Intel used the same ALU operations simply because they are the obvious set of 8 operations. However, if you look at other processors around that time, they use a wide variety of ALU operations. Similarly, the status flags in the Datapoint 2200 aren't the obvious set; systems with four flags typically used Sign, Carry, Zero, and Overflow (not Parity). Parity is surprisingly expensive to implement on a standard processor, but (as Philip Freidin pointed out) parity is cheap on a serial processor like the Datapoint 2200. Intel processors didn't provide an Overflow flag until the 8086; even the 8080 didn't have it although the Motorola 6800 and MOS 6502 did. The 8085 implemented an overflow flag (V) but it was left undocumented. ↩

  9. You might wonder if the Datapoint 2200 (and 8008) could be considered RISC processors since they have simple, easy-to-decode instruction sets. I think it is a mistake to try to wedge every processor into the RISC or CISC categories (Reduced Instruction Set Computer or Complex Instruction Set Computer). In particular, the Datapoint 2200 wasn't designed with the RISC philosophy (make a processor more powerful by simplifying the instruction set), its instruction set architecture is very different from RISC chips, and its implementation is different from RISC chips. Similarly, it wasn't designed with a CISC philosophy (make a processor more powerful by narrowing the semantic gap with high-level languages) and it doesn't look like a CISC chip.

    So where does that leave the Datapoint 2200? In "RISC: Back to the future?", famed computer architect Gordon Bell uses the term MISC (Minimal Instruction Set Computer) to describe the architecture of simple, early computers and microprocessors such as the Manchester Mark I (1948), the PDP-8 minicomputer (1966), and the Intel 4004 (1971). Computer architecture evolved from these early hardwired "simple computers" to microprogrammed processors, processors with cache, and hardwired, pipelined processors. "Minimal Instruction Set Computer" seems like a good description of the Datapoint 2200, since it is about the smallest, simplest processor that could get the job done. ↩

  10. Many people think that the Intel 8008 is an extension of the 4-bit Intel 4004 processor, but they are completely unrelated aside from the part numbers. The Intel 4004 is a 4-bit processor designed to implement a calculator for a company called Busicom. Its architecture is completely different from the 8008. In particular, the 4004 is a "Harvard architecture" system, with data storage and instruction storage completely separate. The 4004 also has a fairly strange instruction set, designed for calculators. For instance, it has a special instruction to convert a keyboard scan code to binary. The 4004 team and the 8008 team at Intel had many people in common, however, so the two chips have physical layouts (floorplans) that are very similar. ↩

  11. In this article, I'm focusing on the Datapoint 2200 Version I. Any time I refer to the Datapoint 2200, I mean the version I specifically. The Version II has an expanded instruction set, but it was expanded in an entirely different direction from the Intel 8080, so it's not relevant to this post. The Version II is interesting, however, since it provides a perspective of how the Intel 8080 could have developed in an "alternate universe". ↩

  12. Federico Faggin wrote The Birth of the Microprocessor in Byte Magazine, March 1992. This article describes in some detail the creation of the 8008 and 8080.

    The Oral History of the 8080 discusses many of the problems with the 8008 and how the 8080 addressed them. (See page 4.) Masatoshi Shima, one of the architects of the 4004, described five problems with the 8008: It was slow because it used two clock cycles per state. It had no general-purpose stack and was weak with interrupts. It had limited memory and I/O space. The instruction set was primitive, with only 8-bit data, limited addressing, and a single address pointer register. Finally, the system bus required a lot of interface circuitry. (See page 7.) ↩

  13. The 8080 is often said to be the "first truly usable microprocessor". Supposedly the source of this quote is Forgotten PC history, but the statement doesn't appear there. I haven't been able to find the original source of this statement, so let me know. In any case, I don't think that statement is particularly accurate, as the Motorola 6800 was "truly usable" and came out before the Intel 8080.

    The 8080 was first in one important way, though: it was Intel's first microprocessor that was designed with feedback from customers. Both the 4004 and the 8008 were custom chips for a single company. The 8080, however, was based on extensive customer feedback about the flaws in the 8008 and what features customers wanted. The 8080 oral history discusses this in more detail. ↩

  14. The 8008 was built with PMOS circuitry, while the 8080 was built with NMOS. This may seem like a trivial difference, but NMOS provided much superior performance. NMOS became the standard microprocessor technology until the rise of CMOS in the 1980s, combining NMOS and PMOS to dramatically reduce power consumption.

    Another key hardware improvement was that the 8080 used a 40-pin package, compared to the 18-pin package of the 8008. Intel had long followed the "religion" of small 16-pin packages, and only reluctantly moved to 18 pins (as in the 8008). However, by the time the 8080 was introduced, Intel recognized the utility of industry-standard 40-pin packages. The additional pins made the 8080 much easier to interface to a system. Moreover, the 8080's 16-bit address bus supported four times the memory of the 8008's 14-bit address bus. (The 40-pin package was still small for the time; some companies used 50-pin or 64-pin packages for microprocessors.) ↩

  15. The 8080 is not binary-compatible with the 8008 because almost all the instructions were shifted to different opcodes. One important but subtle change was that the 8 register/memory codes were reordered to start with B instead of A. The motivation is that this gave registers in a 16-bit register pair (BC, DE, or HL) codes that differ only in the low bit. This makes it easier to specify a register pair with a two-bit code. ↩

  16. Stan Mazor (one of the creators of the 4004 and 8080) explained that the 8085 removed 10 of the 12 new instructions because "they would burden the 8086 instruction set." Because the decision came near the 8085's release, they would "leave all 12 instructions on the already designed 8085 CPU chip, but document and announce only two of them" since modifying a CPU is hard but modifying a CPU's paper reference manual is easy.

    Several of the Intel 8086 engineers provided a similar explanation in Intel Microprocessors: 8008 to 8086: While the 8085 provided the new RIM and SIM instructions, "several other instructions that had been contemplated were not made available because of the software ramifications and the compatibility constraints they would place on the forthcoming 8086."

    For more information on the 8085's undocumented instructions, see Unspecified 8085 op codes enhance programming. The two new condition flags were V (2's complement overflow) and X5 (underflow on decrement or overflow on increment). The opcodes were DSUB (double (i.e. 16-bit) subtraction), ARHL (arithmetic shift right of HL), RDEL (rotate DE left through carry), LDHI (load DE with HL plus an immediate byte), LDSI (load DE with SP plus an immediate byte), RSTV (restart on overflow), LHLX (load HL indirect through DE), SHLX (store HL indirect through DE), JX5 (jump on X5), and JNX5 (jump on not X5). ↩

  17. Conversion from 8080 assembly code to 8086 assembly code was performed with a tool called CONV86. Each line of 8080 assembly code was converted to the corresponding line (or sometimes a few lines) of 8086 assembly code. The program wasn't perfect, so it was expected that the user would need to do some manual editing. In particular, CONV86 couldn't handle self-modifying code, where the program changed its own instructions. (Nowadays, self-modifying code is almost never used, but it was more common in the 1970s in order to make code smaller and get more performance.) CONV86 also didn't handle the 8085's RIM and SIM instructions, recommending a rewrite if code used these instructions heavily.

    Writing programs in 8086 assembly code manually was better, of course, since the program could take advantage of the 8086's new features. Moreover, a program converted by CONV86 might be 25% larger, due to the 8086's use of two-byte instructions and inefficiencies in the conversion. ↩

  18. This renaming is why the instruction set has the registers in the order AX, CX, DX, BX, rather than in alphabetical order as you might expect. The other factor is that Intel decided that AX, BX, CX, and DX corresponded to Accumulator, Base, Count, and Data, so they couldn't assign the names arbitrarily. ↩

  19. A few notes on how the 8086's instructions relate to the earlier machines, since the ModR/M byte and 8- vs. 16-bit instructions make things a bit confusing. For an instruction like ADD, I have three 8-bit opcodes highlighted: an add to memory/register, an add from memory/register, and an immediate add. The neighboring unhighlighted opcodes are the corresponding 16-bit versions. Likewise, for MOV, I have highlighted the 8-bit moves to/from a register/memory. ↩

  20. Since the x86's 32-bit architecture is called IA-32, you might expect that IA-64 would be the 64-bit architecture. Instead, IA-64 is the completely different architecture used in the ill-fated Itanium. IA-64 was supposed to replace IA-32, despite being completely incompatible. Since AMD was cut out of IA-64, AMD developed their own 64-bit extension of the existing x86 architecture and called it AMD64. Customers flocked to this architecture while the Itanium languished. Intel reluctantly copied the AMD64 architecture, calling it Intel 64. ↩

  21. The x86 architecture allows byte access to certain parts of the larger registers (accessing AL, AH, etc.) as well as word and larger accesses. These partial-width reads and writes to registers make the implementation of the processor harder due to register renaming. The problem is that writing to part of a register means that the register's value is a combination of the old and new values. The Register Alias Table in the P6 architecture deals with this by adding a size field to each entry. If you write a short value and then read a longer value, the pipeline stalls to figure out the right value. Moreover, some 16-bit code uses the two 8-bit parts of a register as independent registers. To support this, the Register Alias Table keeps separate entries for the high and low byte. (For details, see the book Modern Processor Design, in particular the chapter on Intel's P6 Microarchitecture.) The point of this is that obscure features of the Datapoint 2200 (such as H and L acting as a combined register) can cause implementation difficulties 50 years later. ↩

  22. Some miscellaneous references: For a detailed history of the Datapoint 2200, see Datapoint: The Lost Story of the Texans Who Invented the Personal Computer Revolution. The 8008 oral history provides a lot of interesting information on the development of the 8008. For another look at the Datapoint 2200 and instruction sets, see Comparing Datapoint 2200, 8008, 8080 and Z80 Instruction Sets. ↩

A close look at the 8086 processor's bus hold circuitry

5 August 2023 at 16:39

The Intel 8086 microprocessor (1978) revolutionized computing by founding the x86 architecture that continues to this day. One of the lesser-known features of the 8086 is the "hold" functionality, which allows an external device to temporarily take control of the system's bus. This feature was most important for supporting the 8087 math coprocessor chip, which was an option on the IBM PC; the 8087 used the bus hold so it could interact with the system without conflicting with the 8086 processor.

This blog post explains in detail how the bus hold feature is implemented in the processor's logic. (Be warned that this post is a detailed look at a somewhat obscure feature.) I've also found some apparently undocumented characteristics of the 8086's hold acknowledge circuitry, designed to make signal transition faster on the shared control lines.

The die photo below shows the main functional blocks of the 8086 processor. In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured. The 8086 is partitioned into a Bus Interface Unit (upper) that handles bus traffic, and an Execution Unit (lower) that executes instructions. The two units operate mostly independently, which will turn out to be important. The Bus Interface Unit handles read and write operations as requested by the Execution Unit. The Bus Interface Unit also prefetches instructions that the Execution Unit uses when it needs them. The hold control circuitry is highlighted in the upper right; it takes a nontrivial amount of space on the chip. The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins. I've labeled the MN/MX, HOLD, and HLDA pads; these are the relevant signals for this post.

The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.

The 8086 die under the microscope, with the main functional blocks and relevant pins labeled. Click this image (or any other) for a larger version.

How bus hold works

In an 8086 system, the processor communicates with memory and I/O devices over a bus consisting of address and data lines along with various control signals. For high-speed data transfer, it is useful for an I/O device to send data directly to memory, bypassing the processor; this is called DMA (Direct Memory Access). Moreover, a co-processor such as the 8087 floating point unit may need to read data from memory. The bus hold feature supports these operations: it is a mechanism for the 8086 to give up control of the bus, letting another device use the bus to communicate with memory. Specifically, an external device requests a bus hold and the 8086 stops putting electrical signals on the bus and acknowledges the bus hold. The other device can now use the bus. When the other device is done, it signals the 8086, which then resumes its regular bus activity.

Most things in the 8086 are more complicated than you might expect, and the bus hold feature is no exception, largely due to the 8086's minimum and maximum modes. The 8086 can be designed into a system in one of two waysβ€”minimum mode and maximum modeβ€”that redefine the meanings of the 8086's external pins. Minimum mode is designed for simple systems and gives the control pins straightforward meanings such as indicating a read versus a write. Minimum mode provides bus signals that were similar to the earlier 8080 microprocessor, making migration to the 8086 easier. On the other hand, maximum mode is designed for sophisticated, multiprocessor systems and encodes the control signals to provide richer system information.

In more detail, minimum mode is selected if the MN/MX pin is wired high, while maximum mode is selected if the MN/MX pin is wired low. Nine of the chip's pins have different meanings depending on the mode, but only two pins are relevant to this discussion. In minimum mode, pin 31 has the function HOLD, while pin 30 has the function HLDA (Hold Acknowlege). In maximum mode, pin 31 has the function RQ/GT0', while pin 30 has the function RQ/GT1'.

I'll start by explaining how a hold operation works in minimum mode. When an external device wants to use the bus, it pulls the HOLD pin high. At the end of the current bus cycle, the 8086 acknowledges the hold request by pulling HLDA high. The 8086 also puts its bus output pins into "tri-state" mode, in effect disconnecting them electrically from the bus. When the external device is done, it pulls HOLD low and the 8086 regains control of the bus. Don't worry about the details of the timing below; the key point is that a device pulls HOLD high and the 8086 responds by pulling HLDA high.

This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14.

This diagram shows the HOLD/HLDA sequence. From iAPX 86,88 User's Manual, Figure 4-14.

The 8086's maximum mode is more complex, allowing two other devices to share the bus by using a priority-based scheme. Maximum mode uses two bidirectional signals, RQ/GT0 and RQ/GT1.2 When a device wants to use the bus, it issues a pulse on one of the signal lines, pulling it low. The 8086 responds by pulsing the same line. When the device is done with the bus, it issues a third pulse to inform the 8086. The RQ/GT0 line has higher priority than RQ/GT1, so if two devices request the bus at the same time, the RQ/GT0 device wins and the RQ/GT1 device needs to wait.1 Keep in mind that the RQ/GT lines are bidirectional: the 8086 and the external device both use the same line for signaling.

This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16.

This diagram shows the request/grant sequence. From iAPX 86,88 User's Manual, Figure 4-16.

The bus hold does not completely stop the 8086. The hold operation stops the Bus Interface Unit, but the Execution Unit will continue executing instructions until it needs to perform a read or write, or it empties the prefetch queue. Specifically, the hold signal blocks the Bus Interface Unit from starting a memory cycle and blocks an instruction prefetch from starting.

Bus sharing and the 8087 coprocessor

Probably the most common use of the bus hold feature was to support the Intel 8087 math coprocessor. The 8087 coprocessor greatly improved the performance of floating-point operations, making them up to 100 times faster. As well as floating-point arithmetic, the 8087 supported trigonometric operations, logarithms and powers. The 8087's architecture became part of later Intel processors, and the 8087's instructions are still a part of today's x86 computers.3

The 8087 had its own registers and didn't have access to the 8086's registers. Instead, the 8087 could transfer values to and from the system's main memory. Specifically, the 8087 used the RQ/GT mechanism (maximum mode) to take control of the bus if the 8087 needed to transfer operands to or from memory.4 The 8087 could be installed as an option on the original IBM PC, which is why the IBM PC used maximum mode.

The enable flip-flop

The circuit is built from six flip-flops. The flip-flops are a bit different from typical D flip-flops, so I'll discuss the flip-flop behavior before explaining the circuit.

A flip-flop can store a single bit, 0 or 1. Flip flops are very important in the 8086 because they hold information (state) in a stable way, and they synchronize the circuitry with the processor's clock. A common type of flip-flop is the D flip-flop, which takes a data input (D) and stores that value. In an edge-triggered flip-flop, this storage happens on the edge when the clock changes state from low to high.5 (Except at this transition, the input can change without affecting the output.) The output is called Q, while the inverted output is called Q-bar.

The symbol for the D flip-flop with enable.

The symbol for the D flip-flop with enable.

Many of the 8086's flip-flops, including the ones in the hold circuit, have an "enable" input. When the enable input is high, the flip-flop records the D input, but when the enable input is low, the flip-flop keeps its previous value. Thus, the enable input allows the flip-flop to hold its value for more than one clock cycle. The enable input is very important to the functioning of the hold circuit, as it is used to control when the circuit moves to the next state.

How bus hold is implemented (minimum mode)

I'll start by explaining how the hold circuitry works in minimum mode. To review, in minimum mode the external device requests a hold through the HOLD input, keeping the input high for the duration of the request. The 8086 responds by pulling the hold acknowledge HLDA output high for the duration of the hold.

In minimum mode, only three of the six flip-flops are relevant. The diagram below is highly simplified to show the essential behavior. (The full schematic is in the footnotes.6) At the left is the HOLD signal, the request from the external device.

Simplified diagram of the circuitry for minimum mode.

Simplified diagram of the circuitry for minimum mode.

When a HOLD request comes in, the first flip-flop is activated, and remains activated for the duration of the request. The second flip-flop waits if any condition is blocking the hold request: a LOCK instruction, an unaligned memory access, or so forth. When the HOLD can proceed, the second flip-flop is enabled and it latches the request. The second flip-flop controls the internal hold signal, causing the 8086 to stop further bus activity. The third flip-flop is then activated when the current bus cycle (if any) completes; when it latches the request, the hold is "official". The third flip-flop drives the external HLDA (Hold Acknowledge) pin, indicating that the bus is free. This signal also clears the bus-enabled latch (elsewhere in the 8086), putting the appropriate pins into floating tri-state mode. The key point is that the flip-flops control the timing of the internal hold and the external HLDA, moving to the next step as appropriate.

When the external device signals an end to the hold by pulling the HOLD pin low, the process reverses. The three flip-flops return to their idle state in sequence. The second flip-flop clears the internal hold signal, restarting bus activity. The third flip-flop clears the HLDA pin.7

How bus hold is implemented (maximum mode)

The implementation of maximum mode is tricky because it uses the same circuitry as minimum mode, but the behavior is different in several ways. First, minimum mode and maximum mode operate on opposite polarity: a hold is requested by pulling HOLD high in minimum mode versus pulling a request line low in maximum mode. Moreover, in minimum mode, a request on the HOLD pin triggers a response on the opposite pin (HLDA), while in maximum mode, a request and response are on the same pin. Finally, using the same pin for the request and grant signals requires the pin to act as both an input and an output, with tricky electrical properties.

In maximum mode, the top three flip-flops handle the request and grant on line 0, while the bottom three flip-flops handle line 1. At a high level, these flip-flops behave roughly the same as in the minimum mode case, with the first flip-flop tracking the hold request, the second flip-flop activated when the hold is "approved", and the third flip-flop activated when the bus cycle completes. An RQ 0 input will generate a GT 0 output, while a RQ 1 input will generate a GT 1 output. The diagram below is highly simplified, but illustrates the overall behavior. Keep in mind that RQ 0, GT 0, and HOLD use the same physical pin, as do RQ 1, GT 1, and HLDA.

Simplified diagram of the circuitry for maximum mode.

Simplified diagram of the circuitry for maximum mode.

In more detail, the first flip-flop converts the pulse request input into a steady signal. This is accomplished by configuring the first flip-flop is configured with to toggle on when the request pulse is received and toggle off when the end-of-request pulse is received.10 The toggle action is implemented by feeding the output pulse back to the input, inverted (A); since the flip-flop is enabled by the RQ input, the flip-flop holds its value until an input pulse. One tricky part is that the acknowledge pulse must not toggle the flip-flop. This is accomplished by using the output signal to block the toggle. (To keep the diagram simple, I've just noted the "block" action rather than showing the logic.)

As before, the second flip-flop is blocked until the hold is "authorized" to proceed. However, the circuitry is more complicated since it must prioritize the two request lines and ensure that only one hold is granted at a time. If RQ0's first flip-flop is active, it blocks the enable of RQ1's second flip-flop (B). Conversely, if RQ1's second flip-flop is active, it blocks the enable of RQ0's second flip-flop (C). Note the asymmetry, blocking on RQ0's first flip-flop and RQ1's second flip-flop. This enforces the priority of RQ0 over RQ1, since an RQ0 request blocks RQ1 but only an RQ1 "approval" blocks RQ0.

When the second flip-flop is activated in either path, it triggers the internal hold signal (D).8 As before, the hold request is latched into the third flip-flop when any existing memory cycle completes. When the hold request is granted, a pulse is generated (E) on the corresponding GT pin.9

The same circuitry is used for minimum mode and maximum mode, although the above diagrams show differences between the two modes. How does this work? Essentially, logic gates are used to change the behavior between minimum mode and maximum mode as required. For the most part, the circuitry works the same, so only a moderate amount of logic is required to make the same circuitry work for both. On the schematic, the signal MN is active during minimum mode, while MX is active during maximum mode, and these signals control the behavior.

The "hold ok" circuit

As usually happens with the 8086, there are a bunch of special cases when different features interact. One special case is if a bus hold request comes in while the 8086 is acknowledging an interrupt. In this case, the interrupt takes priority and the bus hold is not processed until the interrupt acknowledgment is completed. A second special case is if the bus hold occurs while the 8086 is halted. In this case, the 8086 issues a second HALT indicator at the end of the bus hold. Yet another special case is the 8086's LOCK prefix, which locks the use of the bus for the following instruction, so a bus hold request is not honored until the locked instruction has completed. Finally, the 8086 performs an unaligned word access to memory by breaking it into two 8-bit bus cycles; these two cycles can't be interrupted by a bus hold.

In more detail, the "hold ok" circuit determines at each cycle if a hold could proceed. There are several conditions under which the hold can proceed:

  • The bus cycle is `T2`, except if an unaligned bus operation is taking place (i.e. a word split into two byte operations), or
  • A memory cycle is not active and a microcode memory operation is not taking place, or
  • A memory cycle is not active and a hold is currently active.

The first case occurs during bus (memory) activity, where a hold request before cycle T2 will be handled at the end of that cycle. The second case allows a hold if the bus is inactive. But if microcode is performing a memory operation, the hold will be delayed, even if the request is just starting. The third case is opposite from the other two: it enables the flip flop so a hold request can be dropped. (This ensures that the hold request can still be dropped in the corner case where a hold starts and then the microcode makes a memory request, which will be blocked by the hold.)

The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.

The "hold ok" circuit. This has been rearranged from the schematic to make the behavior more clear.

An instruction with the LOCK prefix causes the bus to be locked against other devices for the duration of the instruction. Thus, a hold cannot be granted while the instruction is running. This is implemented through a separate path. This logic is between the output of the first (request) flip-flop and the second (accepted) flip-flop, tied into the LOCK signal. Conceptually, it seems that the LOCK signal should block hold-ok and thus block the second (accepted) flip-flop from being enabled. But instead, the LOCK signal blocks the data path, unless the request has already been granted. I think the motivation is to allow dropping of a hold request to proceed uninterrupted. In other words, LOCK prevents a hold from being accepted, but it doesn't prevent a hold from being dropped, and it was easier to implement this in the data path.

The pin drive circuitry

The circuitry for the HOLD/RQ0/GT0 and HLDA/RQ1/GT1 pins is somewhat complicated, since they are used for both input and output. In minimum mode, the HOLD pin is an input, while the HLDA pin is an output. In maximum mode, both pins act as an input, with a low-going pulse from an external device to start or stop the hold. But the 8086 also issues pulses to grant the hold. Pull-up resistors inside the 8086 to ensure that the pins remain high (idle) when unused. Finally, an undocumented active pull-up system restores a pin to a high state after it is pulled low, providing faster response than the resistor.

The schematic below shows the heart of the tri-state output circuit. Each pin is connected to two large output MOSFETs, one to drive the pin high and one to drive the pin low. The transistors have separate control lines; if both control lines are low, both transistors are off and the pin floats in the "tri-state" condition. This permits the pin to be used as an input, driven by an external device. The pull-up resistor keeps the pin in a high state.

The tri-state output circuit for each hold pin.

The tri-state output circuit for each hold pin.

The diagram below shows how this circuitry looks on the die. In this image, the metal and polysilicon layers have been removed with acid to show the underlying doped silicon regions. The thin white stripes are transistor gates where polysilicon wiring crossed the silicon. The black circles are vias that connected the silicon to the metal on top. The empty regions at the right are where the metal pads for HOLD and HLDA were. Next to the pads are the large transistors to pull the outputs high or low. Because the outputs require much higher current than internal signals, these transistors are much larger than logic transistors. They are composed of several transistors placed in parallel, resulting in the parallel stripes. The small pullup resistors are also visible. For efficiency, these resistors are actually depletion-mode transistors, specially doped to act as constant-current sources.

The HOLD/HLDA pin circuitry on the die.

The HOLD/HLDA pin circuitry on the die.

At the left, some of the circuitry is visible. The large output transistors are driven by "superbuffers" that provide more current than regular NMOS buffers. (A superbuffer uses separate transistors to pull the signal high and low, rather than using a pull-up to pull the signal high as in a standard NMOS gate.) The small transistors are the pass transistors that gate output signals according to the clock. The thick rectangles are crossovers, allowing the vertical metal wiring (no longer visible) to cross over a horizontal signal in the signal layer. The 8086 has only a single metal layer, so the layout requires a crossover if signals will otherwise intersect. Because silicon's resistance is much higher than metal's resistance, the crossover is relatively wide to reduce the resistance.

The problem with a pull-up resistor is that it is relatively slow when connected to a line with high capacitance. You essentially end up with a resistor-capacitor delay circuit, as the resistor slowly charges the line and brings it up to a high level. To get around this, the 8086 has an active drive circuit to pulse the RQ/GT lines high to pull them back from a low level. This circuit pulls the line high one cycle after the 8086 drops it low for a grant acknowledge. This circuit also pulls the line high after the external device pulls it low.11 (The schematic for this circuit is in the footnotes.) The curious thing is that I couldn't find this behavior documented in the datasheets. The datasheets describe the internal pull-up resistor, but don't mention that the 8086 actively pulls the lines high.12

Conclusions

The hold circuitry was a key feature of the 8086, largely because it was necessary for the 8087 math coprocessor chip. The hold circuitry seems like it should be straightforward, but there are many corner cases in this circuitry: it interacts with unaligned memory accesses, the LOCK prefix, and minimum vs. maximum modes. As a result, it is fairly complicated.

Personally, I find the hold circuitry somewhat unsatisfying to study, with few fundamental concepts but a lot of special-case logic. The circuitry seems overly complicated for what it does. Much of the complexity is probably due to the wildly different behavior of the pins between minimum and maximum mode. Intel should have simply used a larger package (like the Motorola 68000) rather than re-using pins to support different modes, as well as using the same pin for a request and a grant. It's impressive, however, the same circuitry was made to work for both minimum and maximum modes, despite using completely different signals to request and grant holds. This circuitry must have been a nightmare for Intel's test engineers, trying to ensure that the circuitry performed properly when there were so many corner cases and potential race conditions.

I plan to write more on the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space and Bluesky as @righto.com so you can follow me there too.

Notes and references

  1. The timing of priority between RQ0 and RQ1 is left vague in the documentation. In practice, even if RQ1 is requested first, a later RQ0 can still preempt it until the point that RQ1 is internally granted (i.e. the second flip-flop is activated). This happens before the hold is externally acknowledged, so it's not obvious to the user at what point priority no longer applies. ↩

  2. The RQ/GT0 and RQ/GT1 signals are active-low. These signals should have an overbar to indicate this, but it makes the page formatting ugly :-) ↩

  3. Modern x86 processors still support the 8087 (x87) instruction set. Starting with the 80486DX, the floating point unit was included on the CPU die, rather than as an external coprocessor. The x87 instruction set used a stack-based model, which made it harder to parallelize. To mitigate this, Intel introduced SSE in 1999, a different set of floating point instructions that worked on an independent register set. The x87 instructions are now considered mostly obsolete and are deprecated in 64-bit Windows. ↩

  4. The 8087 provides another RQ/GT input line for an external device. Thus, two external devices can still be used in a system with an 8087. That is, although the 8087 uses up one of the 8086's two RQ/GT lines, the 8087 provides another one, so there are still two lines available. The 8087 has logic to combine its bus requests and external bus requests into a single RQ/GT line to the 8086. ↩

  5. Confusingly, some of the flip-flops in the hold circuit transistion when the clock goes high, while others use the inverted clock signal and transition when the clock goes low. Moreover, the flip-flops are inconsistent about how they treat the data. In each group of three flip-flops, the first flip-flop is active-high, while the remaining flip-flops are active-low. For the most part, I'll ignore this in the discussion. You can look at the schematic if you want to know the details. ↩

  6. The schematics below shows my reverse-engineered schematic for the hold circuitry. I have partitioned the schematic into the hold logic and the output driver circuitry. This split matches the physical partitioning of the circuitry on the die.

    In the first schematic, the upper part handles HOLD and request0, while the lower part handles request1. There is some circuitry in the middle to handle the common enabling and to generate the internal hold signal. I won't explain the circuitry in much detail, but there are a few things I want to point out. First, even though the hold circuit seems like it should be simple, there are a lot of gates connected in complicated ways. Second, although there are many inverters, NAND, and NOR gates, there are also complex gates such as AND-NOR, OR-NAND, AND-OR-NAND, and so forth. These are implemented as single gates. Due to how gates are constructed from NMOS transistors, it is just as easy to build a hierarchical gate as a single gate. (The last step must be inversion, though.) The XOR gates are more complex; they are constructed from a NOR gate and an AND-NOR gate.

    Schematic of the hold circuitry. Click this image (or any other) for a larger version.

    Schematic of the hold circuitry. Click this image (or any other) for a larger version.

    The schematic below shows the output circuits for the two pins. These circuits are similar, but have a few differences because only the bottom one is used as an output (HLDA) in minimum mode. Each circuit has two inputs: what the current value of the pin is, and what the desired value of the pin is.

    Schematic of the pin output circuits.

    Schematic of the pin output circuits.

     ↩

  7. Interestingly, the external pins aren't taken out of tri-state mode immediately when the HLDA signal is dropped. Instead, the 8086's bus drivers are re-enabled when a bus cycle starts, which is slightly later. The bus circuitry has a separate flip-flop to manage the enable/disable state, and the start of a bus cycle is what re-enables the bus. This is another example of behavior that the documentation leaves ambiguous. ↩

  8. There's one more complication for the hold-out signal. If a hold is granted on one line, a request comes in on the other line, and then the hold is released on the first line, the desired behavior is for the bus to remain in the hold state as the hold switches to the second line. However, because of the way a hold on line 1 blocks a hold on line 0, the GT1 second flip-flop will drop a cycle before the GT0 second flip-flop is activated. This would cause hold-out to drop for a cycle and the 8086 could start unwanted bus activity. To prevent this case, the hold-out line is also activated if there is an RQ0 request and RQ1 is granted. This condition seems a bit random but it covers the "gap". I have to wonder if Intel planned the circuit this way, or they added the extra test as a bug fix. (The asymmetry of the priority circuit causes this problem to occur only going from a hold on line 1 to line 0, not the other direction.) ↩

  9. The pulse-generating circuit is a bit tricky. A pulse signal is generated if the request has been accepted, has not been granted, and will be granted on the next clock (i.e. no memory request is active so the flip-flop is enabled). (Note that the pulse will be one cycle wide, since granting the request on the next cycle will cause the conditions to be no longer satisfied.) This provides the pulse one clock cycle before the flip-flop makes it "official". Moreover, the signals come from the inverted Q outputs from the flip-flops, which are updated half a clock cycle earlier. The result is that the pulse is generated 1.5 clock cycles before the flip-flop output. Presumably the point of this is to respond to hold requests faster, but it seems overly complicated. ↩

  10. The request pulse is required to be one clock cycle wide. The feedback loop shows why: if the request is longer than one clock cycle, the first flip-flop will repeatedly toggle on and off, resulting in unexpected behavior. ↩

  11. The details of the active pull-up circuitry don't make sense to me. First it XORs the state of the pin with the desired state of the pin and uses this signal to control a multiplexer, which generates the pull-up action based on other gates. The result of all this ends up being simply NAND, implemented with excessive gates. Another issue is that minimum mode blocks the active pull-up, which makes sense. But there are additional logic gates so minimum mode can affect the value going into the multiplexer, which gets blocked in minimum mode, so that logic seems wasted. There are also two separate circuits to block pull-up during reset. My suspicion is that the original logic accumulated bug fixes and redundant logic wasn't removed. But it's possible that the implementation is doing something clever that I'm just missing. ↩

  12. My analysis of the RQ/GT lines being pulled high is based on simulation. It would be interesting to verify this behavior on a physical 8086 chip. By measuring the current out of the pin, the pull-up pulses should be visible. ↩

Undocumented 8086 instructions, explained by the microcode

16 July 2023 at 06:28

What happens if you give the Intel 8086 processor an instruction that doesn't exist? A modern microprocessor (80186 and later) will generate an exception, indicating that an illegal instruction was executed. However, early microprocessors didn't include the circuitry to detect illegal instructions, since the chips didn't have transistors to spare. Instead these processors would do something, but the results weren't specified.1

The 8086 has a number of undocumented instructions. Most of them are simply duplicates of regular instructions, but a few have unexpected behavior, such as revealing the values of internal, hidden registers. In the 8086, most instructions are implemented in microcode, so examining the 8086's microcode can explain why these instructions behave the way they do.

The photo below shows the 8086 die under a microscope, with the important functional blocks labeled. The metal layer is visible, while the underlying silicon and polysilicon wiring is mostly hidden. The microcode ROM and the microcode address decoder are in the lower right. The Group Decode ROM (upper center) is also important, as it performs the first step of instruction decoding.

The 8086 die under a microscope, with main functional blocks labeled. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. Click on this image (or any other) for a larger version.

Microcode and 8086 instruction decoding

You might think that machine instructions are the basic steps that a computer performs. However, instructions usually require multiple steps inside the processor. One way of expressing these multiple steps is through microcode, a technique dating back to 1951. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task.

The 8086's microcode ROM holds 512 micro-instructions, each 21 bits wide. Each micro-instruction performs two actions in parallel. First is a move between a source and a destination, typically registers. Second is an operation that can range from an arithmetic (ALU) operation to a memory access. The diagram below shows the structure of a 21-bit micro-instruction, divided into six types.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

When executing a machine instruction, the 8086 performs a decoding step. Although the 8086 is a 16-bit processor, its instructions are based on bytes. In most cases, the first byte specifies the opcode, which may be followed by additional instruction bytes. In other cases, the byte is a "prefix" byte, which changes the behavior of the following instruction. The first byte is analyzed by something called the Group Decode ROM. This circuit categorizes the first byte of the instruction into about 35 categories that control how the instruction is decoded and executed. One category is "1-byte logic"; this indicates a one-byte instruction or prefix that is simple and implemented by logic circuitry in the 8086. For instructions in this category, microcode is not involved while the remaining instructions are implemented in microcode. Many of these instructions are in the "two-byte ROM" category indicating that the instruction has a second byte that also needs to be decoded by microcode. This second byte, called the ModR/M byte, specifies that memory addressing mode or registers that the instruction uses.

The next step is the microcode's address decoder circuit, which determines where to start executing microcode based on the opcode. Conceptually, you can think of the microcode as stored in a ROM, indexed by the instruction opcode and a few sequence bits. However, since many instructions can use the same microcode, it would be inefficient to store duplicate copies of these routines. Instead, the microcode address decoder permits multiple instructions to reference the same entries in the ROM. This decoding circuitry is similar to a PLA (Programmable Logic Array) so it matches bit patterns to determine a particular starting point. This turns out to be important for undocumented instructions since undocumented instructions often match the pattern for a "real" instruction, making the undocumented instruction an alias.

The 8086 has several internal registers that are invisible to the programmer but are used by the microcode. Memory accesses use the Indirect (IND) and Operand (OPR) registers; the IND register holds the address in the segment, while the OPR register holds the data value that is read or written. Although these registers are normally not accessible by the programmer, some undocumented instructions provide access to these registers, as will be described later.

The Arithmetic/Logic Unit (ALU) performs arithmetic, logical, and shift operations in the 8086. The ALU uses three internal registers: tmpA, tmpB, and tmpC. An ALU operation requires two micro-instructions. The first micro-instruction specifies the operation (such as ADD) and the temporary register that holds one argument (e.g. tmpA); the second argument is always in tmpB. A following micro-instruction can access the ALU result through the pseudo-register Ξ£ (sigma).

The ModR/M byte

A fundamental part of the 8086 instruction format is the ModR/M byte, a byte that specifies addressing for many instructions. The 8086 has a variety of addressing modes, so the ModR/M byte is somewhat complicated. Normally it specifies one memory address and one register. The memory address is specified through one of eight addressing modes (below) along with an optional 8- or 16-bit displacement in the instruction. Instead of a memory address, the ModR/M byte can also specify a second register. For a few opcodes, the ModR/M byte selects what instruction to execute rather than a register.

The 8086's addressing modes. From The register assignments, from MCS-86 Assembly Language Reference Guide.

The implementation of the ModR/M byte plays an important role in the behavior of undocumented instructions. Support for this byte is implemented in both microcode and hardware. The various memory address modes above are implemented by microcode subroutines, which compute the appropriate memory address and perform a read if necessary. The subroutine leaves the memory address in the IND register, and if a read is performed, the value is in the OPR register.

The hardware hides the ModR/M byte's selection of memory versus register, by making the value available through the pseudo-register M, while the second register is available through N. Thus, the microcode for an instruction doesn't need to know if the value was in memory or a register, or which register was selected. The Group Decode ROM examines the first byte of the instruction to determine if a ModR/M byte is present, and if a read is required. If the ModR/M byte specifies memory, the Translation ROM determines which micro-subroutines to call before handling the instruction itself. For more on the ModR/M byte, see my post on Reverse-engineering the ModR/M addressing microcode.

Holes in the opcode table

The first byte of the instruction is a value from 00 to FF in hex. Almost all of these opcode values correspond to documented 8086 instructions, but there are a few exceptions, "holes" in the opcode table. The table below shows the 256 first-byte opcodes for the 8086, from hex 00 to FF. Valid opcodes for the 8086 are in white; the colored opcodes are undefined and interesting to examine. Orange, yellow, and green opcodes were given meaning in the 80186, 80286, and 80386 respectively. The purple opcode is unusual: it was implemented in the 8086 and later processors but not documented.2 In this section, I'll examine the microcode for these opcode holes.

This table shows the 256 opcodes for the 8086, where the white ones are valid instructions. Click for a larger version.

This table shows the 256 opcodes for the 8086, where the white ones are valid instructions. Click for a larger version.

D6: SALC

The opcode D6 (purple above) performs a well-known but undocumented operation that is typically called SALC, for Set AL to Carry. This instruction sets the AL register to 0 if the carry flag is 0, and sets the AL register to FF if the carry flag is 1. The curious thing about this undocumented instruction is that it exists in all x86 CPUs, but Intel didn't mention it until 2017. Intel probably put this instruction into the processor deliberately as a copyright trap. The idea is that if a company created a copy of the 8086 processor and the processor included the SALC instruction, this would prove that the company had copied Intel's microcode and thus had potentially violated Intel's copyright on the microcode. This came to light when NEC created improved versions of the 8086, the NEC V20 and V30 microprocessors, and was sued by Intel. Intel analyzed NEC's microcode but was disappointed to find that NEC's chip did not include the hidden instruction, showing that NEC hadn't copied the microcode.3 Although a Federal judge ruled in 1989 that NEC hadn't infringed Intel's copyright, the 5-year trial ruined NEC's market momentum.

The SALC instruction is implemented with three micro-instructions, shown below.4 The first micro-instruction jumps if the carry (CY) is set. If not, the next instruction moves 0 to the AL register. RNI (Run Next Instruction) ends the microcode execution causing the next machine instruction to run. If the carry was set, all-ones (i.e. FF hex) is moved to the AL register and RNI ends the microcode sequence.

           JMPS CY 2 SALC: jump on carry
ZERO β†’ AL  RNI       Move 0 to AL, run next instruction
ONES β†’ AL  RNI       2:Move FF to AL, run next instruction

0F: POP CS

The 0F opcode is the first hole in the opcode table. The 8086 has instructions to push and pop the four segment registers, except opcode 0F is undefined where POP CS should be. This opcode performs POP CS successfully, so the question is why is it undefined? The reason is that POP CS is essentially useless and doesn't do what you'd expect, so Intel figured it was best not to document it.

To understand why POP CS is useless, I need to step back and explain the 8086's segment registers. The 8086 has a 20-bit address space, but 16-bit registers. To make this work, the 8086 has the concept of segments: memory is accessed in 64K chunks called segments, which are positioned in the 1-megabyte address space. Specifically, there are four segments: Code Segment, Stack Segment, Data Segment, and Extra Segment, with four segment registers that define the start of the segment: CS, SS, DS, and ES.

An inconvenient part of segment addressing is that if you want to access more than 64K, you need to change the segment register. So you might push the data segment register, change it temporarily so you can access a new part of memory, and then pop the old data segment register value off the stack. This would use the PUSH DS and POP DS instructions. But why not POP CS?

The 8086 executes code from the code segment, with the instruction pointer (IP) tracking the location in the code segment. The main problem with POP CS is that it changes the code segment, but not the instruction pointer, so now you are executing code at the old offset in a new segment. Unless you line up your code extremely carefully, the result is that you're jumping to an unexpected place in memory. (Normally, you want to change CS and the instruction pointer at the same time, using a CALL or JMP instruction.)

The second problem with POP CS is prefetching. For efficiency, the 8086 prefetches instructions before they are needed, storing them in an 6-byte prefetch queue. When you perform a jump, for instance, the microcode flushes the prefetch queue so execution will continue with the new instructions, rather than the old instructions. However, the instructions that pop a segment register don't flush the prefetch buffer. Thus, POP CS not only jumps to an unexpected location in memory, but it will execute an unpredictable number of instructions from the old code path.

The POP segment register microcode below packs a lot into three micro-instructions. The first micro-instruction pops a value from the stack. Specifically, it moves the stack pointer (SP) to the Indirect (IND) register. The Indirect register is an internal register, invisible to the programmer, that holds the address offset for memory accesses. The first micro-instruction also performs a memory read (R) from the stack segment (SS) and then increments IND by 2 (P2, plus 2). The second micro-instruction moves IND to the stack pointer, updating the stack pointer with the new value. It also tells the microcode engine that this micro-instruction is the next-to-last (NXT) and the next machine instruction can be started. The final micro-instruction moves the value read from memory to the appropriate segment register and runs the next instruction. Specifically, reads and writes put data in the internal OPR (Operand) register. The hardware uses the register N to indicate the register specified by the instruction. That is, the value will be stored in the CS, DS, ES, or SS register, depending on the bit pattern in the instruction. Thus, the same microcode works for all four segment registers. This is why POP CS works even though POP CS wasn't explicitly implemented in the microcode; it uses the common code.

SP β†’ IND  R SS,P2 POP sr: read from stack, compute IND plus 2
IND β†’ SP  NXT     Put updated value in SP, start next instruction.
OPR β†’ N   RNI     Put stack value in specified segment register

But why does POP CS run this microcode in the first place? The microcode to execute is selected based on the instruction, but multiple instructions can execute the same microcode. You can think of the address decoder as pattern-matching on the instruction's bit patterns, where some of the bits can be ignored. In this case, the POP sr microcode above is run by any instruction with the bit pattern 000??111, where a question mark can be either a 0 or a 1. You can verify that this pattern matches POP ES (07), POP SS (17), and POP DS (1F). However, it also matches 0F, which is why the 0F opcode runs the above microcode and performs POP CS. In other words, to make 0F do something other than POP CS would require additional circuitry, so it was easier to leave the action implemented but undocumented.

60-6F: conditional jumps

One whole row of the opcode table is unused: values 60 to 6F. These opcodes simply act the same as 70 to 7F, the conditional jump instructions.

The conditional jumps use the following microcode. It fetches the jump offset from the instruction prefetch queue (Q) and puts the value into the ALU's tmpBL register, the low byte of the tmpB register. It tests the condition in the instruction (XC) and jumps to the RELJMP micro-subroutine if satisfied. The RELJMP code (not shown) updates the program counter to perform the jump.

Q β†’ tmpBL                Jcond cb: Get offset from prefetch queue
           JMP XC RELJMP Test condition, if true jump to RELJMP routine
           RNI           No jump: run next instruction

This code is executed for any instruction matching the bit pattern 011?????, i.e. anything from 60 to 7F. The condition is specified by the four low bits of the instruction. The result is that any instruction 60-6F is an alias for the corresponding conditional jump 70-7F.

C0, C8: RET/RETF imm

These undocumented opcodes act like a return instruction, specifically RET imm16 (source). Specifically, the instruction C0 is the same as C2, near return, while C8 is the same as CA, far return.

The microcode below is executed for the instruction bits 1100?0?0, so it is executed for C0, C2, C8, and CA. It gets two bytes from the instruction prefetch queue (Q) and puts them in the tmpA register. Next, it calls FARRET, which performs either a near return (popping PC from the stack) or a far return (popping PC and CS from the stack). Finally, it adds the original argument to the SP, equivalent to popping that many bytes.

Q β†’ tmpAL    ADD tmpA    RET/RETF iw: Get word from prefetch, set up ADD
Q β†’ tmpAH    CALL FARRET Call Far Return micro-subroutine
IND β†’ tmpB               Move SP (in IND) to tmpB for ADD
Ξ£ β†’ SP       RNI         Put sum in Stack Pointer, end

One tricky part is that the FARRET micro-subroutine examines bit 3 of the instruction to determine whether it does a near return or a far return. This is why documented instruction C2 is a near return and CA is a far return. Since C0 and C8 run the same microcode, they will perform the same actions, a near return and a far return respectively.

C1: RET

The undocumented C1 opcode is identical to the documented C3, near return instruction. The microcode below is executed for instruction bits 110000?1, i.e. C1 and C3. The first micro-instruction reads from the Stack Pointer, incrementing IND by 2. Prefetching is suspended and the prefetch queue is flushed, since execution will continue at a new location. The Program Counter is updated with the value from the stack, read into the OPR register. Finally, the updated address is put in the Stack Pointer and execution ends.

SP β†’ IND  R SS,P2  RET:  Read from stack, increment by 2
          SUSP     Suspend prefetching
OPR β†’ PC  FLUSH    Update PC from stack, flush prefetch queue
IND β†’ SP  RNI      Update SP, run next instruction

C9: RET

The undocumented C9 opcode is identical to the documented CB, far return instruction. This microcode is executed for instruction bits 110010?1, i.e. C9 and CB, so C9 is identical to CB. The microcode below simply calls the FARRET micro-subroutine to pop the Program Counter and CS register. Then the new value is stored into the Stack Pointer. One subtlety is that FARRET looks at bit 3 of the instruction to switch between a near return and a far return, as described earlier. Since C9 and CB both have bit 3 set, they both perform a far return.

          CALL FARRET  RETF: call FARRET routine
IND β†’ SP  RNI          Update stack pointer, run next instruction

F1: LOCK prefix

The final hole in the opcode table is F1. This opcode is different because it is implemented in logic rather than microcode. The Group Decode ROM indicates that F1 is a prefix, one-byte logic, and LOCK. The Group Decode outputs are the same as F0, so F1 also acts as a LOCK prefix.

Holes in two-byte opcodes

For most of the 8086 instructions, the first byte specifies the instruction. However, the 8086 has a few instructions where the second byte specifies the instruction: the reg field of the ModR/M byte provides an opcode extension that selects the instruction.5 These fall into four categories which Intel labeled "Immed", "Shift", "Group 1", and "Group 2", corresponding to opcodes 80-83, D0-D3, F6-F7, and FE-FF. The table below shows how the second byte selects the instruction. Note that "Shift", "Group 1", and "Group 2" all have gaps, resulting in undocumented values.

Meaning of the reg field in two-byte opcodes. From MCS-86 Assembly Language Reference Guide.

Meaning of the reg field in two-byte opcodes. From MCS-86 Assembly Language Reference Guide.

These sets of instructions are implemented in two completely different ways. The "Immed" and "Shift" instructions run microcode in the standard way, selected by the first byte. For a typical arithmetic/logic instruction such as ADD, bits 5-3 of the first instruction byte are latched into the X register to indicate which ALU operation to perform. The microcode specifies a generic ALU operation, while the X register controls whether the operation is an ADD, SUB, XOR, or so forth. However, the Group Decode ROM indicates that for the special "Immed" and "Shift" instructions, the X register latches the bits from the second byte. Thus, when the microcode executes a generic ALU operation, it ends up with the one specified in the second byte.6

The "Group 1" and "Group 2" instructions (F0-F1, FE-FF), however, run different microcode for each instruction. Bits 5-3 of the second byte replace bits 2-0 of the instruction before executing the microcode. Thus, F0 and F1 act as if they are opcodes in the range F0-F7, while FE and FF act as if they are opcodes in the range F8-FF. Thus, each instruction specified by the second byte can have its own microcode, unlike the "Immed" and "Shift" instructions. The trick that makes this work is that all the "real" opcodes in the range F0-FF are implemented in logic, not microcode, so there are no collisions.

The hole in "Shift": SETMO, D0..D3/6

There is a "hole" in the list of shift operations when the second byte has the bits 110 (6). (This is typically expressed as D0/6 and so forth; the value after the slash is the opcode-selection bits in the ModR/M byte.) Internally, this value selects the ALU's SETMO (Set Minus One) operation, which simply returns FF or FFFF, for a byte or word operation respectively.7

The microcode below is executed for 1101000? bit patterns patterns (D0 and D1). The first instruction gets the value from the M register and sets up the ALU to do whatever operation was specified in the instruction (indicated by XI). Thus, the same microcode is used for all the "Shift" instructions, including SETMO. The result is written back to M. If no writeback to memory is required (NWB), then RNI runs the next instruction, ending the microcode sequence. However, if the result is going to memory, then the last line writes the value to memory.

M β†’ tmpB  XI tmpB, NXT  rot rm, 1: get argument, set up ALU
Ξ£ β†’ M     NWB,RNI F     Store result, maybe run next instruction
          W DS,P0 RNI   Write result to memory

The D2 and D3 instructions (1101001?) perform a variable number of shifts, specified by the CL register, so they use different microcode (below). This microcode loops the number of times specified by CL, but the control flow is a bit tricky to avoid shifting if the intial counter value is 0. The code sets up the ALU to pass the counter (in tmpA) unmodified the first time (PASS) and jumps to 4, which updates the counter and sets up the ALU for the shift operation (XI). If the counter is not zero, it jumps back to 3, which performs the previously-specified shift and sets up the ALU to decrement the counter (DEC). This time, the code at 4 decrements the counter. The loop continues until the counter reaches zero. The microcode stores the result as in the previous microcode.

ZERO β†’ tmpA               rot rm,CL: 0 to tmpA
CX β†’ tmpAL   PASS tmpA    Get count to tmpAL, set up ALU to pass through
M β†’ tmpB     JMPS 4       Get value, jump to loop (4)
Ξ£ β†’ tmpB     DEC tmpA F   3: Update result, set up decrement of count
Ξ£ β†’ tmpA     XI tmpB      4: update count in tmpA, set up ALU
             JMPS NZ 3    Loop if count not zero
tmpB β†’ M     NWB,RNI      Store result, maybe run next instruction
             W DS,P0 RNI  Write result to memory

The hole in "group 1": TEST, F6/1 and F7/1

The F6 and F7 opcodes are in "group 1", with the specific instruction specified by bits 5-3 of the second byte. The second-byte table showed a hole for the 001 bit sequence. As explained earlier, these bits replace the low-order bits of the instruction, so F6 with 001 is processed as if it were the opcode F1. The microcode below matches against instruction bits 1111000?, so F6/1 and F7/1 have the same effect as F6/0 and F7/1 respectively, that is, the byte and word TEST instructions.

The microcode below gets one or two bytes from the prefetch queue (Q); the L8 condition tests if the operation is an 8-bit (i.e. byte) operation and skips the second micro-instruction. The third micro-instruction ANDs the argument and the fetched value. The condition flags (F) are set based on the result, but the result itself is discarded. Thus, the TEST instruction tests a value against a mask, seeing if any bits are set.

Q β†’ tmpBL    JMPS L8 2     TEST rm,i: Get byte, jump if operation length = 8
Q β†’ tmpBH                  Get second byte from the prefetch queue
M β†’ tmpA     AND tmpA, NXT 2: Get argument, AND with fetched value
Ξ£ β†’ no dest  RNI F         Discard result but set flags.

I explained the processing of these "Group 3" instructions in more detail in my microcode article.

The hole in "group 2": PUSH, FE/7 and FF/7

The FE and FF opcodes are in "group 2", which has a hole for the 111 bit sequence in the second byte. After replacement, this will be processed as the FF opcode, which matches the pattern 1111111?. In other words, the instruction will be processed the same as the 110 bit pattern, which is PUSH. The microcode gets the Stack Pointer, sets up the ALU to decrement it by 2. The new value is written to SP and IND. Finally, the register value is written to stack memory.

SP β†’ tmpA  DEC2 tmpA   PUSH rm: set up decrement SP by 2
Ξ£ β†’ IND                Decremented SP to IND
Ξ£ β†’ SP                 Decremented SP to SP
M β†’ OPR    W SS,P0 RNI Write the value to memory, done

82 and 83 "Immed" group

Opcodes 80-83 are the "Immed" group, performing one of eight arithmetic operations, specified in the ModR/M byte. The four opcodes differ in the size of the values: opcode 80 applies an 8-bit immediate value to an 8-bit register, 81 applies a 16-bit value to a 16-bit register, 82 applies an 8-bit value to an 8-bit register, and 83 applies an 8-bit value to a 16-bit register. The opcode 82 has the strange situation that some sources say it is undocumented, but it shows up in some Intel documentation as a valid bit combination (e.g. below). Note that 80 and 82 have the 8-bit to 8-bit action, so the 82 opcode is redundant.

ADC is one of the instructions with opcode 80-83. From the 8086 datasheet, page 27.

ADC is one of the instructions with opcode 80-83. From the 8086 datasheet, page 27.

The microcode below is used for all four opcodes. If the ModR/M byte specifies memory, the appropriate micro-subroutine is called to compute the effective address in IND, and fetch the byte or word into OPR. The first two instructions below get the two immediate data bytes from the prefetch queue; for an 8-bit operation, the second byte is skipped. Next, the second argument M is loaded into tmpA and the desired ALU operation (XI) is configured. The result Ξ£ is stored into the specified register M and the operation may terminate with RNI. But if the ModR/M byte specified memory, the following write micro-operation saves the value to memory.

Q β†’ tmpBL  JMPS L8 2    alu rm,i: get byte, test if 8-bit op
Q β†’ tmpBH               Maybe get second byte
M β†’ tmpA   XI tmpA, NXT 2: 
Ξ£ β†’ M      NWB,RNI F    Save result, update flags, done if no memory writeback
           W DS,P0 RNI  Write result to memory if needed

The tricky part of this is the L8 condition, which tests if the operation is 8-bit. You might think that bit 0 acts as the byte/word bit in a nice, orthogonal way, but the 8086 has a bunch of special cases. Bit 0 of the instruction typically selects between a byte and a word operation, but there are a bunch of special cases. The Group Decode ROM creates a signal indicating if bit 0 should be used as the byte/word bit. But it generates a second signal indicating that an instruction should be forced to operate on bytes, for instructions such as DAA and XLAT. Another Group Decode ROM signal indicates that bit 3 of the instruction should select byte or word; this is used for the MOV instructions with opcodes Bx. Yet another Group Decode ROM signal indicates that inverted bit 1 of the instruction should select byte or word; this is used for a few opcodes, including 80-87.

The important thing here is that for the opcodes under discussion (80-83), the L8 micro-condition uses both bits 0 and 1 to determine if the instruction is 8 bits or not. The result is that only opcode 81 is considered 16-bit by the L8 test, so it is the only one that uses two immediate bytes from the instruction. However, the register operations use only bit 0 to select a byte or word transfer. The result is that opcode 83 has the unusual behavior of using an 8-bit immediate operand with a 16-bit register. In this case, the 8-bit value is sign-extended to form a 16-bit value. That is, the top bit of the 8-bit value fills the entire upper half of the 16-bit value, converting an 8-bit signed value to a 16-bit signed value (e.g. -1 is FF, which becomes FFFF). This makes sense for arithmetic operations, but not much sense for logical operations.

Intel documentation is inconsistent about which opcodes are listed for which instructions. Intel opcode maps generally define opcodes 80-83. However, lists of specific instructions show opcodes 80, 81, and 83 for arithmetic operations but only 80 and 81 for logical operations.8 That is, Intel omits the redundant 82 opcode as well as omitting logic operations that perform sign-extension (83).

More FE holes

For the "group 2" instructions, the FE opcode performs a byte operation while FF performs a word operation. Many of these operations don't make sense for bytes: CALL, JMP, and PUSH. (The only instructions supported for FE are INC and DEC.) But what happens if you use the unsupported instructions? The remainder of this section examines those cases and shows that the results are not useful.

CALL: FE/2

This instruction performs an indirect subroutine call within a segment, reading the target address from the memory location specified by the ModR/M byte.

The microcode below is a bit convoluted because the code falls through into the shared NEARCALL routine, so there is some unnecessary register movement. Before this microcode executes, the appropriate ModR/M micro-subroutine will read the target address from memory. The code below copies the destination address from M to tmpB and stores it into the PC later in the code to transfer execution. The code suspends prefetching, corrects the PC to cancel the offset from prefetching, and flushes the prefetch queue. Finally, it decrements the SP by two and writes the old PC to the stack.

M β†’ tmpB    SUSP        CALL rm: read value, suspend prefetch
SP β†’ IND    CORR        Get SP, correct PC
PC β†’ OPR    DEC2 tmpC   Get PC to write, set up decrement
tmpB β†’ PC   FLUSH       NEARCALL: Update PC, flush prefetch
IND β†’ tmpC              Get SP to decrement
Ξ£ β†’ IND                 Decremented SP to IND
Ξ£ β†’ SP      W SS,P0 RNI Update SP, write old PC to stack

This code will mess up in two ways when executed as a byte instruction. First, when the destination address is read from memory, only a byte will be read, so the destination address will be corrupted. (I think that the behavior here depends on the bus hardware. The 8086 will ask for a byte from memory but will read the word that is placed on the bus. Thus, if memory returns a word, this part may operate correctly. The 8088's behavior will be different because of its 8-bit bus.) The second issue is writing the old PC to the stack because only a byte of the PC will be written. Thus, when the code returns from the subroutine call, the return address will be corrupt.

CALL: FE/3

This instruction performs an indirect subroutine call between segments, reading the target address from the memory location specified by the ModR/M byte.

IND β†’ tmpC  INC2 tmpC    CALL FAR rm: set up IND+2
Ξ£ β†’ IND     R DS,P0      Read new CS, update IND
OPR β†’ tmpA  DEC2 tmpC    New CS to tmpA, set up SP-2
SP β†’ tmpC   SUSP         FARCALL: Suspend prefetch
Ξ£ β†’ IND     CORR         FARCALL2: Update IND, correct PC
CS β†’ OPR    W SS,M2      Push old CS, decrement IND by 2
tmpA β†’ CS   PASS tmpC    Update CS, set up for NEARCALL
PC β†’ OPR    JMP NEARCALL Continue with NEARCALL

As in the previous CALL, this microcode will fail in multiple ways when executed in byte mode. The new CS and PC addresses will be read from memory as bytes, which may or may not work. Only a byte of the old CS and PC will be pushed to the stack.

JMP: FE/4

This instruction performs an indirect jump within a segment, reading the target address from the memory location specified by the ModR/M byte. The microcode is short, since the ModR/M micro-subroutine does most of the work. I believe this will have the same problem as the previous CALL instructions, that it will attempt to read a byte from memory instead of a word.

        SUSP       JMP rm: Suspend prefetch
M β†’ PC  FLUSH RNI  Update PC with new address, flush prefetch, done

JMP: FE/5

This instruction performs an indirect jump between segments, reading the new PC and CS values from the memory location specified by the ModR/M byte. The ModR/M micro-subroutine reads the new PC address. This microcode increments IND and suspends prefetching. It updates the PC, reads the new CS value from memory, and updates the CS. As before, the reads from memory will read bytes instead of words, so this code will not meaningfully work in byte mode.

IND β†’ tmpC  INC2 tmpC   JMP FAR rm: set up IND+2
Ξ£ β†’ IND     SUSP        Update IND, suspend prefetch
tmpB β†’ PC   R DS,P0     Update PC, read new CS from memory
OPR β†’ CS    FLUSH RNI   Update CS, flush prefetch, done

PUSH: FE/6

This instruction pushes the register or memory value specified by the ModR/M byte. It decrements the SP by 2 and then writes the value to the stack. It will write one byte to the stack but decrements the SP by 2, so one byte of old stack data will be on the stack along with the data byte.

SP β†’ tmpA  DEC2 tmpA    PUSH rm: Set up SP decrement 
Ξ£ β†’ IND                 Decremented value to IND
Ξ£ β†’ SP                  Decremented value to SP
M β†’ OPR    W SS,P0 RNI  Write the data to the stack

Undocumented instruction values

The next category of undocumented instructions is where the first byte indicates a valid instruction, but there is something wrong with the second byte.

AAM: ASCII Adjust after Multiply

The AAM instruction is a fairly obscure one, designed to support binary-coded decimal arithmetic (BCD). After multiplying two BCD digits, you end up with a binary value between 0 and 81 (0Γ—0 to 9Γ—9). If you want a BCD result, the AAM instruction converts this binary value to BCD, for instance splitting 81 into the decimal digits 8 and 1, where the upper digit is 81 divided by 10, and the lower digit is 81 modulo 10.

The interesting thing about AAM is that the 2-byte instruction is D4 0A. You might notice that hex 0A is 10, and this is not a coincidence. There wasn't an easy way to get the value 10 in the microcode, so instead they made the instruction provide that value in the second byte. The undocumented (but well-known) part is that if you provide a value other than 10, the instruction will convert the binary input into digits in that base. For example, if you provide 8 as the second byte, the instruction returns the value divided by 8 and the value modulo 8.

The microcode for AAM, below, sets up the registers. calls the CORD (Core Division) micro-subroutine to perform the division, and then puts the results into AH and AL. In more detail, the CORD routine divides tmpA/tmpC by tmpB, putting the complement of the quotient in tmpC and leaving the remainder in tmpA. (If you want to know how CORD works internally, see my division post.) The important step is that the AAM microcode gets the divisor from the prefetch queue (Q). After calling CORD, it sets up the ALU to perform a 1's complement of tmpC and puts the result (Ξ£) into AH. It sets up the ALU to pass tmpA through unchanged, puts the result (Ξ£) into AL, and updates the flags accordingly (F).

Q β†’ tmpB                    AAM: Move byte from prefetch to tmpB
ZERO β†’ tmpA                 Move 0 to tmpA
AL β†’ tmpC    CALL CORD      Move AL to tmpC, call CORD.
             COM1 tmpC      Set ALU to complement
Ξ£ β†’ AH       PASS tmpA, NXT Complement AL to AH
Ξ£ β†’ AL       RNI F          Pass tmpA through ALU to set flags

The interesting thing is why this code has undocumented behavior. The 8086's microcode only has support for the constants 0 and all-1's (FF or FFFF), but the microcode needs to divide by 10. One solution would be to implement an additional micro-instruction and more circuitry to provide the constant 10, but every transistor was precious back then. Instead, the designers took the approach of simply putting the number 10 as the second byte of the instruction and loading the constant from there. Since the AAM instruction is not used very much, making the instruction two bytes long wasn't much of a drawback. But if you put a different number in the second byte, that's the divisor the microcode will use. (Of course you could add circuitry to verify that the number is 10, but then the implementation is no longer simple.)

Intel could have documented the full behavior, but that creates several problems. First, Intel would be stuck supporting the full behavior into the future. Second, there are corner cases to deal with, such as divide-by-zero. Third, testing the chip would become harder because all these cases would need to be tested. Fourth, the documentation would become long and confusing. It's not surprising that Intel left the full behavior undocumented.

AAD: ASCII Adjust before Division

The AAD instruction is analogous to AAM but used for BCD division. In this case, you want to divide a two-digit BCD number by something, where the BCD digits are in AH and AL. The AAD instruction converts the two-digit BCD number to binary by computing AHΓ—10+AL, before you perform the division.

The microcode for AAD is shown below. The microcode sets up the registers, calls the multiplication micro-subroutine CORX (Core Times), and then puts the results in AH and AL. In more detail, the multiplier comes from the instruction prefetch queue Q. The CORX routine multiples tmpC by tmpB, putting the result in tmpA/tmpC. Then the microcode adds the low BCD digit (AL) to the product (tmpB + tmpC), putting the sum (Ξ£) into AL, clearing AH and setting the status flags F appropriately.

One interesting thing is that the second-last micro-instruction jumps to AAEND, which is the last micro-instruction of the AAM microcode above. By reusing the micro-instruction from AAM, the microcode is one micro-instruction shorter, but the jump adds one cycle to the execution time. (The CORX routine is used for integer multiplication; I discuss the internals in this post.)

Q β†’ tmpC              AAD: Get byte from prefetch queue.
AH β†’ tmpB   CALL CORX Call CORX
AL β†’ tmpB   ADD tmpC  Set ALU for ADD
ZERO β†’ AH   JMP AAEND Zero AH, jump to AAEND
i
...
Ξ£ β†’ AL      RNI F     AAEND: Sum to AL, done.

As with AAM, the constant 10 is provided in the second byte of the instruction. The microcode accepts any value here, but values other than 10 are undocumented.

8C, 8E: MOV sr

The opcodes 8C and 8E perform a MOV register to or from the specified segment register, using the register specification field in the ModR/M byte. There are four segment registers and three selection bits, so an invalid segment register can be specified. However, the hardware that decodes the register number ignores instruction bit 5 for a segment register. Thus, specifying a segment register 4 to 7 is the same as specifying a segment register 0 to 3. For more details, see my article on 8086 register codes.

Unexpected REP prefix

REP IMUL / IDIV

The REP prefix is used with string operations to cause the operation to be repeated across a block of memory. However, if you use this prefix with an IMUL or IDIV instruction, it has the unexpected behavior of negating the product or the quotient (source).

The reason for this behavior is that the string operations use an internal flag called F1 to indicate that a REP prefix has been applied. The multiply and divide code reuses this flag to track the sign of the input values, toggling F1 for each negative value. If F1 is set, the value at the end is negated. (This handles "two negatives make a positive.") The consequence is that the REP prefix puts the flag in the 1 state when the multiply/divide starts, so the computed sign will be wrong at the end and the result is the negative of the expected result. The microcode is fairly complex, so I won't show it here; I explain it in detail in this blog post.

REP RET

Wikipedia lists REP RET (i.e. RET with a REP prefix) as a way to implement a two-byte return instruction. This is kind of trivial; the RET microcode (like almost every instruction) doesn't use the F1 internal flag, so the REP prefix has no effect.

REPNZ MOVS/STOS

Wikipedia mentions that the use of the REPNZ prefix (as opposed to REPZ) is undefined with string operations other than CMPS/SCAS. An internal flag called F1Z distinguishes between the REPZ and REPNZ prefixes. This flag is only used by CMPS/SCAS. Since the other string instructions ignore this flag, they will ignore the difference between REPZ and REPNZ. I wrote about string operations in more detail in this post.

Using a register instead of memory.

Some instructions are documented as requiring a memory operand. However, the ModR/M byte can specify a register. The behavior in these cases can be highly unusual, providing access to hidden registers. Examining the microcode shows how this happens.

LEA reg

Many instructions have a ModR/M byte that indicates the memory address that the instruction should use, perhaps through a complicated addressing mode. The LEA (Load Effective Address) instruction is different: it doesn't access the memory location but returns the address itself. The undocumented part is that the ModR/M byte can specify a register instead of a memory location. In that case, what does the LEA instruction do? Obviously it can't return the address of a register, but it needs to return something.

The behavior of LEA is explained by how the 8086 handles the ModR/M byte. Before running the microcode corresponding to the instruction, the microcode engine calls a short micro-subroutine for the particular addressing mode. This micro-subroutine puts the desired memory address (the effective address) into the tmpA register. The effective address is copied to the IND (Indirect) register and the value is loaded from memory if needed. On the other hand, if the ModR/M byte specified a register instead of memory, no micro-subroutine is called. (I explain ModR/M handling in more detail in this article.)

The microcode for LEA itself is just one line. It stores the effective address in the IND register into the specified destination register, indicated by N. This assumes that the appropriate ModR/M micro-subroutine was called before this code, putting the effective address into IND.

IND β†’ N   RNI  LEA: store IND register in destination, done

But if a register was specified instead of a memory location, no ModR/M micro-subroutine gets called. Instead, the LEA instruction will return whatever value was left in IND from before, typically the previous memory location that was accessed. Thus, LEA can be used to read the value of the IND register, which is normally hidden from the programmer.

LDS reg, LES reg

The LDS and LES instructions load a far pointer from memory into the specified segment register and general-purpose register. The microcode below assumes that the appropriate ModR/M micro-subroutine has set up IND and read the first value into OPR. The microcode updates the destination register, increments IND by 2, reads the next value, and updates DS. (The microcode for LES is a copy of this, but updates ES.)

OPR β†’ N               LDS: Copy OPR to dest register
IND β†’ tmpC  INC2 tmpC Set up incrementing IND by 2
Ξ£ β†’ IND     R DS,P0   Update IND, read next location
OPR β†’ DS    RNI       Update DS

If the LDS instruction specifies a register instead of memory, a micro-subroutine will not be called, so IND and OPR will have values from a previous instruction. OPR will be stored in the destination register, while the DS value will be read from the address IND+2. Thus, these instructions provide a mechanism to access the hidden OPR register.

JMP FAR rm

The JMP FAR rm instruction normally jumps to the far address stored in memory at the location indicated by the ModR/M byte. (That is, the ModR/M byte indicates where the new PC and CS values are stored.) But, as with LEA, the behavior is undocumented if the ModR/M byte specifies a register, since a register doesn't hold a four-byte value.

The microcode explains what happens. As with LEA, the code expects a micro-subroutine to put the address into the IND register. In this case, the micro-subroutine also loads the value at that address (i.e. the destination PC) into tmpB. The microcode increments IND by 2 to point to the CS word in memory and reads that into CS. Meanwhile, it updates the PC with tmpB. It suspends prefetching and flushes the queue, so instruction fetching will restart at the new address.

IND β†’ tmpC  INC2 tmpC   JMP FAR rm: set up to add 2 to IND
Ξ£ β†’ IND     SUSP        Update IND, suspend prefetching
tmpB β†’ PC   R DS,P0     Update PC with tmpB. Read new CS from specified address
OPR β†’ CS    FLUSH RNI   Update CS, flush queue, done

If you specify a register instead of memory, the micro-subroutine won't get called. Instead, the program counter will be loaded with whatever value was in tmpB and the CS segment register will be loaded from the memory location two bytes after the location that IND was referencing. Thus, this undocumented use of the instruction gives access to the otherwise-hidden tmpB register.

The end of undocumented instructions

Microprocessor manufacturers soon realized that undocumented instructions were a problem, since programmers find them and often use them. This creates an issue for future processors, or even revisions of the current processor: if you eliminate an undocumented instruction, previously-working code that used the instruction will break, and it will seem like the new processor is faulty.

The solution was for processors to detect undocumented instructions and prevent them from executing. By the early 1980s, processors had enough transistors (thanks to Moore's law) that they could include the circuitry to block unsupported instructions. In particular, the 80186/80188 and the 80286 generated a trap of type 6 when an unused opcode was executed, blocking use of the instruction.9 This trap is also known as #UD (Undefined instruction trap).10

Conclusions

The 8086, like many early microprocessors, has undocumented instructions but no traps to stop them from executing.11 For the 8086, these fall into several categories. Many undocumented instructions simply mirror existing instructions. Some instructions are implemented but not documented for one reason or another, such as SALC and POP CS. Other instructions can be used outside their normal range, such as AAM and AAD. Some instructions are intended to work only with a memory address, so specifying a register can have strange effects such as revealing the values of the hidden IND and OPR registers.

Keep in mind that my analysis is based on transistor-level simulation and examining the microcode; I haven't verified the behavior on a physical 8086 processor. Please let me know if you see any errors in my analysis or undocumented instructions that I have overlooked. Also note that the behavior could change between different versions of the 8086; in particular, some versions by different manufacturers (such as the NEC V20 and V30) are known to be different.

I plan to write more about the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space and Bluesky as @righto.com so you can follow me there too.

Notes and references

  1. The 6502 processor, for instance, has illegal instructions with various effects, including causing the processor to hang. The article How MOS 6502 illegal opcodes really work describes in detail how the instruction decoding results in various illegal opcodes. Some of these opcodes put the internal bus into a floating state, so the behavior is electrically unpredictable. ↩

  2. The 8086 used up almost all the single-byte opcodes, which made it difficult to extend the instruction set. Most of the new instructions for the 386 or later are multi-byte opcodes, either using 0F as a prefix or reusing the earlier REP prefix (F3). Thus, the x86 instruction set is less efficient than it could be, since many single-byte opcodes were "wasted" on hardly-used instructions such as BCD arithmetic, forcing newer instructions to be multi-byte. ↩

  3. For details on the "magic instruction" hidden in the 8086 microcode, see NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright Editors page 49. I haven't found anything stating that SALC was the hidden instruction, but this is the only undocumented instruction that makes sense as something deliberately put into the microcode. The court case is complicated since NEC had a licensing agreement with Intel, so I'm skipping lots of details. See NEC v. Intel: Breaking new ground in the law of copyright for more. ↩

  4. The microcode listings are based on Andrew Jenner's disassembly. I have made some modifications to (hopefully) make it easier to understand. ↩

  5. Specifying the instruction through the ModR/M reg field may seem a bit random, but there's a reason for this. A typical instruction such as ADD has two arguments specified by the ModR/M byte. But other instructions such as shift instructions or NOT only take one argument. For these instructions, the ModR/M reg field would be wasted if it specified a register. Thus, using the reg field to specify instructions that only use one argument makes the instruction set more efficient. ↩

  6. Note that "normal" ALU operations are specified by bits 5-3 of the instruction; in order these are ADD, OR, ADC, SBB, AND, SUB, XOR, and CMP. These are exactly the same ALU operations that the "Immed" group performs, specified by bits 5-3 of the second byte. This illustrates how the same operation selection mechanism (the X register) is used in both cases. Bit 6 of the instruction switches between the set of arithmetic/logic instructions and the set of shift/rotate instructions. ↩

  7. As far as I can tell, SETMO isn't used by the microcode. Thus, I think that SETMO wasn't deliberately implemented in the ALU, but is a consequence of how the ALU's control logic is implemented. That is, all the even entries are left shifts and the odd entries are right shifts, so operation 6 activates the left-shift circuitry. But it doesn't match a specific left shift operation, so the ALU doesn't get configured for a "real" left shift. In other words, the behavior of this instruction is due to how the ALU handles a case that it wasn't specifically designed to handle.

    This function is implemented in the ALU somewhat similar to a shift left. However, instead of passing each input bit to the left, the bit from the right is passed to the left. That is, the input to bit 0 is shifted left to all of the bits of the result. By setting this bit to 1, all bits of the result are set, yielding the minus 1 result. ↩

  8. This footnote provides some references for the "Immed" opcodes. The 8086 datasheet has an opcode map showing opcodes 80 through 83 as valid. However, in the listings of individual instructions it only shows 80 and 81 for logical instructions (i.e. bit 1 must be 0), while it shows 80-83 for arithmetic instructions. The modern Intel 64 and IA-32 Architectures Software Developer's Manual is also contradictory. Looking at the instruction reference for AND (Vol 2A 3-78), for instance, shows opcodes 80, 81, and 83, explicitly labeling 83 as sign-extended. But the opcode map (Table A-2 Vol 2D A-7) shows 80-83 as defined except for 82 in 64-bit mode. The instruction bit diagram (Table B-13 Vol 2D B-7) shows 80-83 valid for the arithmetic and logical instructions. ↩

  9. The 80286 was more thorough about detecting undefined opcodes than the 80186, even taking into account the differences in instruction set. The 80186 generates a trap when 0F, 63-67, F1, or FFFF is executed. The 80286 generates invalid opcode exception number 6 (#UD) on any undefined opcode, handling the following cases:

    • The first byte of an instruction is completely invalid (e.g., 64H).
    • The first byte indicates a 2-byte opcode and the second byte is invalid (e.g., 0F followed by 0FFH).
    • An invalid register is used with an otherwise valid opcode (e.g., MOV CS,AX).
    • An invalid opcode extension is given in the REG field of the ModR/M byte (e.g., 0F6H /1).
    • A register operand is given in an instruction that requires a memory operand (e.g., LGDT AX).
     ↩
  10. In modern x86 processors, most undocumented instructions cause faults. However, there are still a few undocumented instructions that don't fault. These may be for internal use or corner cases of documented instructions. For details, see Breaking the x86 Instruction Set, a video from Black Hat 2017. ↩

  11. Several sources have discussed undocumented 8086 opcodes before. The article Undocumented 8086 Opcodes describes undocumented opcodes in detail. Wikipedia has a list of undocumented x86 instructions. The book Undocumented PC discusses undocumented instructions in the 8086 and later processors. This StackExchange Retrocomputing post describes undocumented instructions. These Hacker News comments discuss some undocumented instructions. There are other sources with more myth than fact, claiming that the 8086 treats undocumented instructions as NOPs, for instance. ↩

Reverse-engineering the 8086 processor's address and data pin circuits

8 July 2023 at 16:14

The Intel 8086 microprocessor (1978) started the x86 architecture that continues to this day. In this blog post, I'm focusing on a small part of the chip: the address and data pins that connect the chip to external memory and I/O devices. In many processors, this circuitry is straightforward, but it is complicated in the 8086 for two reasons. First, Intel decided to package the 8086 as a 40-pin DIP, which didn't provide enough pins for all the functionality. Instead, the 8086 multiplexes address, data, and status. In other words, a pin can have multiple roles, providing an address bit at one time and a data bit at another time.

The second complication is that the 8086 has a 20-bit address space (due to its infamous segment registers), while the data bus is 16 bits wide. As will be seen, the "extra" four address bits have more impact than you might expect. To summarize, 16 pins, called AD0-AD15, provide 16 bits of address and data. The four remaining address pins (A16-A19) are multiplexed for use as status pins, providing information about what the processor is doing for use by other parts of the system. You might expect that the 8086 would thus have two types of pin circuits, but it turns out that there are four distinct circuits, which I will discuss below.

The 8086 die under the microscope, with the main functional blocks and address pins labeled. Click this image (or any other) for a larger version.

The 8086 die under the microscope, with the main functional blocks and address pins labeled. Click this image (or any other) for a larger version.

The microscope image above shows the silicon die of the 8086. In this image, the metal layer on top of the chip is visible, while the silicon and polysilicon underneath are obscured. The square pads around the edge of the die are connected by tiny bond wires to the chip's 40 external pins. The 20 address pins are labeled: Pins AD0 through AD15 function as address and data pins. Pins A16 through A19 function as address pins and status pins.1 The circuitry that controls the pins is highlighted in red. Two internal busses are important for this discussion: the 20-bit AD bus (green) connects the AD pins to the rest of the CPU, while the 16-bit C bus (blue) communicates with the registers. These buses are connected through a circuit that can swap the byte order or shift the value. (The lines on the diagram are simplified; the real wiring twists and turns to fit the layout. Moreover, the C bus (blue) has its bits spread across the width of the register file.)

Segment addressing in the 8086

One goal of the 8086 design was to maintain backward compatibility with the earlier 8080 processor.2 This had a major impact on the 8086's memory design, resulting in the much-hated segment registers. The 8080 (like most of the 8-bit processors of the early 1970s) had a 16-bit address space, able to access 64K (65,536 bytes) of memory, which was plenty at the time. But due to the exponential growth in memory capacity described by Moore's Law, it was clear that the 8086 needed to support much more. Intel decided on a 1-megabyte address space, requiring 20 address bits. But Intel wanted to keep the 16-bit memory addresses used by the 8080.

The solution was to break memory into segments. Each segment was 64K long, so a 16-bit offset was sufficient to access memory in a segment. The segments were allocated in a 1-megabyte address space, with the result that you could access a megabyte of memory, but only in 64K chunks.3 Segment addresses were also 16 bits, but were shifted left by 4 bits (multiplied by 16) to support the 20-bit address space.

Thus, every memory access in the 8086 required a computation of the physical address. The diagram below illustrates this process: the logical address consists of the segment base address and the offset within the segment. The 16-bit segment register was shifted 4 bits and added to the 16-bit offset to yield the 20-bit physical memory address.

The segment register and the offset are added to create a 20-bit physical address.  From iAPX 86,88 User's Manual, page 2-13.

The segment register and the offset are added to create a 20-bit physical address. From iAPX 86,88 User's Manual, page 2-13.

This address computation was not performed by the regular ALU (Arithmetic/Logic Unit), but by a separate adder that was devoted to address computation. The address adder is visible in the upper-left corner of the die photo. I will discuss the address adder in more detail below.

The AD bus and the C Bus

The 8086 has multiple internal buses to move bits internally, but the relevant ones are the AD bus and the C bus. The AD bus is a 20-bit bus that connects the 20 address/data pins to the internal circuitry.4 A 16-bit bus called the C bus provides the connection between the AD bus, the address adder and some of the registers.5 The diagram below shows the connections. The AD bus can be connected to the 20 address pins through latches. The low 16 pins can also be used for data input, while the upper 4 pins can also be used for status output. The address adder performs the 16-bit addition necessary for segment arithmetic. Its output is shifted left by four bits (i.e. it has four 0 bits appended), producing the 20-bit result. The inputs to the adder are provided by registers, a constant ROM that holds small constants such as +1 or -2, or the C bus.

My reverse-engineered diagram showing how the AD bus and the C bus interact with the address pins.

My reverse-engineered diagram showing how the AD bus and the C bus interact with the address pins.

The shift/crossover circuit provides the interface between these two buses, handling the 20-bit to 16-bit conversion. The busses can be connected in three ways: direct, crossover, or shifted.6 The direct mode connects the 16 bits of the C bus to the lower 16 bits of the address/data pins. This is the standard mode for transferring data between the 8086's internal circuitry and the data pins. The crossover mode performs the same connection but swaps the bytes. This is typically used for unaligned memory accesses, where the low memory byte corresponds to the high register byte, or vice versa. The shifted mode shifts the 20-bit AD bus value four positions to the right. In this mode, the 16-bit output from the address adder goes to the 16-bit C bus. (The shift is necessary to counteract the 4-bit shift applied to the address adder's output.) Control circuitry selects the right operation for the shift/crossover circuit at the right time.7

Two of the registers are invisible to the programmer but play an important role in memory accesses. The IND (Indirect) register specifies the memory address; it holds the 16-bit memory offset in a segment. The OPR (Operand) register holds the data value.9 The IND and OPR registers are not accessed directly by the programmer; the microcode for a machine instruction moves the appropriate values to these registers prior to the write.

Overview of a write cycle

I hesitate to present a timing diagram, since I may scare off my readers, but the 8086's communication is designed around a four-step bus cycle. The diagram below shows simplified timing for a write cycle, when the 8086 writes to memory or an I/O device.8 The external bus activity is organized as four states, each one clock cycle long: T1, T2, T3, T4. These T states are very important since they control what happens on the bus. During T1, the 8086 outputs the address on the pins. During the T2, T3, and T4 states, the 8086 outputs the data word on the pins. The important part for this discussion is that the pins are multiplexed depending on the T-state: the pins provide the address during T1 and data during T2 through T4.

A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

A typical write bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

There two undocumented T states that are important to the bus cycle. The physical address is computed in the two clock cycles before T1 so the address will be available in T1. I give these "invisible" T states the names TS (start) and T0.

The address adder

The operation of the address adder is a bit tricky since the 16-bit adder must generate a 20-bit physical address. The adder has two 16-bit inputs: the B input is connected to the upper registers via the B bus, while the C input is connected to the C bus. The segment register value is transferred over the B bus to the adder during the second half of the TS state (that is, two clock cycles before the bus cycle becomes externally visible during T1). Meanwhile, the address offset is transferred over the C bus to the adder, but the adder's C input shifts the value four bits to the right, discarding the four low bits. (As will be explained later, the pin driver circuits latch these bits.) The adder's output is shifted left four bits and transferred to the AD bus during the second half of T0. This produces the upper 16 bits of the 20-bit physical memory address. This value is latched into the address output flip-flops at the start of T1, putting the computed address on the pins. To summarize, the 20-bit address is generated by storing the 4 low-order bits during T0 and then the 16 high-order sum bits during T1.

The address adder is not needed for segment arithmetic during T1 and T2. To improve performance, the 8086 uses the adder during this idle time to increment or decrement memory addresses. For instance, after popping a word from the stack, the stack pointer needs to be incremented by 2. The address adder can do this increment "for free" during T1 and T2, leaving the ALU available for other operations.10 Specifically, the adder updates the memory address in IND, incrementing it or decrementing it as appropriate. First, the IND value is transferred over the B bus to the adder during the second half of T1. Meanwhile, a constant (-3 to +2) is loaded from the Constant ROM and transferred to the adder's C input. The output from the adder is transferred to the AD bus during the second half of T2. As before, the output is shifted four bits to the left. However, the shift/crossover circuit between the AD bus and the C bus is configured to shift four bits to the right, canceling the adder's shift. The result is that the C bus gets the 16-bit sum from the adder, and this value is stored in the IND register.11 For more information on the implemenation of the address adder, see my previous blog post.

The pin driver circuit

Now I'll dive down to the hardware implementation of an output pin. When the 8086 chip communicates with the outside world, it needs to provide relatively high currents. The tiny logic transistors can't provide enough current, so the chip needs to use large output transistors. To fit the large output transistors on the die, they are constructed of multiple wide transistors in parallel.12 Moreover, the drivers use a somewhat unusual "superbuffer" circuit with two transistors: one to pull the output high, and one to pull the output low.13

The diagram below shows the transistor structure for one of the output pins (AD10), consisting of three parallel transistors between the output and +5V, and five parallel transistors between the output and ground. The die photo on the left shows the metal layer on top of the die. This shows the power and ground wiring and the connections to the transistors. The photo on the right shows the die with the metal layer removed, showing the underlying silicon and the polysilicon wiring on top. A transistor gate is formed where a polysilicon wire crosses the doped silicon region. Combined, the +5V transistors are equivalent to about 60 typical transistors, while the ground transistors are equivalent to about 100 typical transistors. Thus, these transistors provide substantially more current to the output pin.

Two views of the output transistors for a pin. The first shows the metal layer, while the second shows the polysilicon and silicon.

Two views of the output transistors for a pin. The first shows the metal layer, while the second shows the polysilicon and silicon.

Tri-state output driver

The output circuit for an address pin uses a tri-state buffer, which allows the output to be disabled by putting it into a high-impedance "tri-state" configuration. In this state, the output is not pulled high or low but is left floating. This capability allows the pin to be used for data input. It also allows external devices to device can take control of the bus, for instance, to perform DMA (direct memory access).

The pin is driven by two large MOSFETs, one to pull the output high and one to pull it low. (As described earlier, each large MOSFET is physically multiple transistors in parallel, but I'll ignore that for now.) If both MOSFETs are off, the output floats, neither on nor off.

Schematic diagram of a "typical" address output pin.

Schematic diagram of a "typical" address output pin.

The tri-state output is implemented by driving the MOSFETs with two "superbuffer"15 AND gates. If the enable input is low, both AND gates produce a low output and both output transistors are off. On the other hand, if enable is high, one AND gate will be on and one will be off. The desired output value is loaded into a flip-flop to hold it,14 and the flip-flop turns one of the output transistors on, driving the output pin high or low as appropriate. (Conveniently, the flip-flop provides the data output Q and the inverted data output Q.) Generally, the address pin outputs are enabled for T1-T4 of a write but only during T1 for a read.16

In the remainder of the discussion, I'll use the tri-state buffer symbol below, rather than showing the implementation of the buffer.

The output circuit, expressed with a tri-state buffer symbol.

The output circuit, expressed with a tri-state buffer symbol.

AD4-AD15

Pins AD4-AD15 are "typical" pins, avoiding the special behavior of the top and bottom pins, so I'll discuss them first. The behavior of these pins is that the value on the AD bus is latched by the circuit and then put on the output pin under the control of the enaable signal. The circuit has three parts: a multiplexer to select the output value, a flip-flop to hold the output value, and a tri-state driver to provide the high-current output to the pin. In more detail, the multiplexer selects either the value on the AD bus or the current output from the flip-flop. That is, the multiplexer can either load a new value into the flip-flop or hold the existing value.17 The flip-flop latches the input value on the falling edge of the clock, passing it to the output driver. If the enable line is high, the output driver puts this value on the corresponding address pin.

The output circuit for AD4-AD15 has a latch to hold the desired output value, an address or data bit.

The output circuit for AD4-AD15 has a latch to hold the desired output value, an address or data bit.

For a write, the circuit latches the address value on the bus during the second half of T0 and puts it on the pins during T1. During the second half of the T1 state, the data word is transferred from the OPR register over the C bus to the AD bus and loaded into the AD pin latches. The word is transferred from the latches to the pins during T2 and held for the remainder of the bus cycle.

AD0-AD3

The four low address bits have a more complex circuit because these address bits are latched from the bus before the address adder computes its sum, as described earlier. The memory offset (before the segment addition) will be on the C bus during the second half of TS and is loaded into the lower flip-flop. This flip-flop delays these bits for one clock cycle and then they are loaded into the upper flip-flop. Thus, these four pins pick up the offset prior to the addition, while the other pins get the result of the segment addition.

The output circuit for AD0-AD3 has a second latch to hold the low address bits before the address adder computes the sum.

The output circuit for AD0-AD3 has a second latch to hold the low address bits before the address adder computes the sum.

For data, the AD0-AD3 pins transfer data directly from the AD bus to the pin latch, bypassing the delay that was used to get the address bits. That is, the AD0-AD3 pins have two paths: the delayed path used for addresses during T0 and the direct path otherwise used for data. Thus, the multiplexer has three inputs: two for these two paths and a third loop-back input to hold the flip-flop value.

A16-A19: status outputs

The top four pins (A16-A19) are treated specially, since they are not used for data. Instead, they provide processor status during T2-T4.18 The pin latches for these pins are loaded with the address during T0 like the other pins, but loaded with status instead of data during T1. The multiplexer at the input to the latch selects the address bit during T0 and the status bit during T1, and holds the value otherwise. The schematic below shows how this is implemented for A16, A17, and A19.

The output circuit for AD16, AD17, and AD19 selects either an address output or a status output.

The output circuit for AD16, AD17, and AD19 selects either an address output or a status output.

Address pin A18 is different because it indicates the current status of the interrupt enable flag bit. This status is updated every clock cycle, unlike the other pins. To implement this, the pin has a different circuit that isn't latched, so the status can be updated continuously. The clocked transistors act as "pass transistors", passing the signal through when active. When a pass transistor is turned off, the following logic gate holds the previous value due to the capacitance of the wiring. Thus, the pass transistors provide a way of holding the value through the clock cycle. The flip-flops are implemented with pass transistors internally, so in a sense the circuit below is a flip-flop that has been "exploded" to provide a second path for the interrupt status.

The output circuit for AD18 is different from the rest so the I flag status can be updated every clock cycle.

The output circuit for AD18 is different from the rest so the I flag status can be updated every clock cycle.

Reads

A memory or I/O read also uses a 4-state bus cycle, slightly different from the write cycle. During T1, the address is provided on the pins, the same as for a write. After that, however, the output circuits are tri-stated so they float, allowing the external memory to put data on the bus. The read data on the pin is put on the AD bus at the start of the T4 state. From there, the data passes through the crossover circuit to the C bus. Normally the 16 data bits pass straight through to the C bus, but the bytes will be swapped if the memory access is unaligned. From the C bus, the data is written to the OPR register, a byte or a word as appropriate. (For an instruction prefetch, the word is written to a prefetch queue register instead.)

A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

A typical read bus cycle consists of four T states. Based on The 8086 Family Users Manual, B-16.

To support data input on the AD0-AD15 pins, they have a circuit to buffer the input data and transfer it to the AD bus. The incoming data bit is buffered by the two inverters and sampled when the clock is high. If the enable' signal is low, the data bit is transferred to the AD bus when the clock is low.19 The two MOSFETs act as a "superbuffer", providing enough current for the fairly long AD bus. I'm not sure what the capacitor accomplishes, maybe avoiding a race condition if the data pin changes just as the clock goes low.20

Schematic of the input circuit for the data pins.

Schematic of the input circuit for the data pins.

This circuit has a second role, precharging the AD bus high when the clock is low, if there's no data. Precharging a bus is fairly common in the 8086 (and other NMOS processors) because NMOS transistors are better at pulling a line low than pulling it high. Thus, it's often faster to precharge a line high before it's needed and then pull it low for a 0.21

Since pins A16-A19 are not used for data, they operate the same for reads as for writes: providing address bits and then status.

The pin circuit on the die

The diagram below shows how the pin circuitry appears on the die. The metal wiring has been removed to show the silicon and polysilicon. The top half of the image is the input circuitry, reading a data bit from the pin and feeding it to the AD bus. The lower half of the image is the output circuitry, reading an address or data bit from the AD bus and amplifying it for output via the pad. The light gray regions are doped, conductive silicon. The thin tan lines are polysilicon, which forms transistor gates where it crosses doped silicon.

The input/output circuitry for an address/data pin. The metal layer has been removed to show the underlying silicon and polysilicon. Some crystals have formed where the bond pad was.

The input/output circuitry for an address/data pin. The metal layer has been removed to show the underlying silicon and polysilicon. Some crystals have formed where the bond pad was.

A historical look at pins and timing

The number of pins on Intel chips has grown exponentially, more than a factor of 100 in 50 years. In the early days, Intel management was convinced that a 16-pin package was large enough for any integrated circuit. As a result, the Intel 4004 processor (1971) was crammed into a 16-pin package. Intel chip designer Federico Faggin22 describes 16-pin packages as a completely silly requirement that was throwing away performance, but the "God-given 16 pins" was like a religion at Intel. When Intel was forced to use 18 pins by the 1103 memory chip, it "was like the sky had dropped from heaven" and he had "never seen so many long faces at Intel." Although the 8008 processor (1972) was able to use 18 pins, this low pin count still harmed performance by forcing pins to be used for multiple purposes.

The Intel 8080 (1974) had a larger, 40-pin package that allowed it to have 16 address pins and 8 data pins. Intel stuck with this size for the 8086, even though competitors used larger packages with more pins.23 As processors became more complex, the 40-pin package became infeasible and the pin count rapidly expanded; The 80286 processor (1982) had a 68-pin package, while the i386 (1985) had 132 pins; the i386 needed many more pins because it had a 32-bit data bus and a 24- or 32-bit address bus. The i486 (1989) went to 196 pins while the original Pentium had 273 pins. Nowadays, a modern Core I9 processor uses the FCLGA1700 socket with a whopping 1700 contacts.

Looking at the history of Intel's bus timing, the 8086's complicated memory timing goes back to the Intel 8008 processor (1972). Instruction execution in the 8008 went through a specific sequence of timing states; each clock cycle was assigned a particular state number. Memory accesses took three cycles: the address was sent to memory during states T1 and T2, half of the address at a time since there were only 8 address pins. During state T3, a data byte was either transmitted to memory or read from memory. Instruction execution took place during T4 and T5. State signals from the 8008 chip indicated which state it was in.

The 8080 used an even more complicated timing system. An instruction consisted of one to five "machine cycles", numbered M1 through M5, where each machine cycle corresponded to a memory or I/O access. Each machine cycle consisted of three to five states, T1 through T5, similar to the 8008 states. The 8080 had 10 different types of machine cycle such as instruction fetch, memory read, memory write, stack read or write, or I/O read or write. The status bits indicated the type of machine cycle. The 8086 kept the T1 through T4 memory cycle. Because the 8086 decoupled instruction prefetching from execution, it no longer had explicit M machine cycles. Instead, it used status bits to indicate 8 types of bus activity such as instruction fetch, read data, or write I/O.

Conclusions

Well, address pins is another subject that I thought would be straightforward to explain but turned out to be surprisingly complicated. Many of the 8086's design decisions combine in the address pins: segmented addressing, backward compatibility, and the small 40-pin package. Moreover, because memory accesses are critical to performance, Intel put a lot of effort into this circuitry. Thus, the pin circuitry is tuned for particular purposes, especially pin A18 which is different from all the rest.

There is a lot more to say about memory accesses and how the 8086's Bus Interface Unit performs them. The process is very complicated, with interacting state machines for memory operation and instruction prefetches, as well as handling unaligned memory accesses. I plan to write more, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space and Bluesky as @righto.com so you can follow me there too.

Notes and references

  1. In the discussion, I'll often call all the address pins "AD" pins for simplicity, even though pins 16-19 are not used for data. ↩

  2. The 8086's compatibility with the 8080 was somewhat limited since the 8086 had a different instruction set. However, Intel provided a conversion program called CONV86 that could convert 8080/8085 assembly code into 8086 assembly code that would usually work after minor editing. The 8086 was designed to make this process straightforward, with a mapping from the 8080's registers onto the 8086's registers, along with a mostly-compatible instruction set. (There were a few 8080 instructions that would be expanded into multiple 8086 instructions.) The conversion worked for straightforward code, but didn't work well with tricky, self-modifying code, for instance. ↩

  3. To support the 8086's segment architecture, programmers needed to deal with "near" and "far" pointers. A near pointer consisted of a 16-bit offset and could access 64K in a segment. A far pointer consisted of a 16-bit offset along with a 16-bit segment address. By modifying the segment register on each access, the full megabyte of memory could be accessed. The drawbacks were that far pointers were twice as big and were slower. ↩

  4. The 8086 patent provides a detailed architectural diagram of the 8086. I've extracted part of the diagram below. In most cases the diagram is accurate, but its description of the C bus doesn't match the real chip. There are some curious differences between the patent diagram and the actual implementation of the 8086, suggesting that the data pins were reorganized between the patent and the completion of the 8086. The diagram shows the address adder (called the Upper Adder) connected to the C bus, which is connected to the address/data pins. In particular, the patent shows the data pins multiplexed with the high address pins, while the low address pins A3-A0 are multiplexed with three status signals. The actual implementation of the 8086 is the other way around, with the data pins multiplexed with the low address pins while the high address pins A19-A16 are multiplexed with the status signals. Moreover, the patent doesn't show anything corresponding to what I call the AD bus; I made up that name. The moral is that while patents can be very informative, they can also be misleading.

    A diagram from patent US4449184 showing the connections to the address pins. This diagram does not match the actual chip. The diagram also shows the old segment register names: RC, RD, RS, and RA became CS, DS, SS, and ES.

    A diagram from patent US4449184 showing the connections to the address pins. This diagram does not match the actual chip. The diagram also shows the old segment register names: RC, RD, RS, and RA became CS, DS, SS, and ES.

     ↩

  5. The C bus is connected to the PC, OPR, and IND registers, as well as the prefetch queue, but is not connected to the segment registers. Two other buses (the ALU bus and the B bus) provide access to the segment registers. ↩

  6. Swapping the bytes on the data pins is required in a few cases. The 8086 has a 16-bit data bus, so transfers are usually a 16-bit word, copied directly between memory and a register. However, the 8086 also allows 8-bit operations, in which case either the top half or bottom half of the word is accessed. Loading an 8-bit value from the top half of a memory word into the bottom half of a register uses the crossover circuit. Another case is performing a 16-bit access to an "unaligned" address, that is, an odd address so the word crosses the normal word boundaries. From the programmer's perspective, an unaligned access is permitted (unlike many RISC processors), but the hardware converts this access into two 8-bit accesses, so the bus itself never handles an unaligned access.

    The 8086 has the ability to access a single memory byte out of a word, either for a byte operation or for an unaligned word operation. This behavior has some important consequences on the address pins. In particular, the low address pin AD0 doesn't behave like the rest of the address pins due to the special handling of odd addresses. Instead, this pin indicates which half of the word to transfer. The AD0 line is low (0) when the lower portion of the bus transfers a byte. Another pin, BHE (Bus High Enable) has a similar role for the upper half of the bus: it is low (0) if a byte is transferred over D15-D8. (Keep in mind that the 8086 is little-endian, so the low byte of the word is first in memory, at the even address.)

    The following table summarizes how BHE and A0 work together to select a byte or word. When accessing a byte at an odd address, A0 is odd as you might expect.

    Access typeBHEA0
    Word00
    Low byte10
    High byte01
     ↩
  7. The cbus-adbus-shift signal is activated during T2, when a memory index is being updated, either the instruction pointer or the IND register. The address adder is used to update the register and the shift undoes the 4-bit left shift applied to the adder's output. The shift is also used for the CORR micro-instruction, which corrects the instruction pointer to account for prefetching. The CORR micro-instruction generates a "fake" short bus cycle in which the constant ROM and the address adder are used during T0. I discuss the CORR micro-instruction in more detail in this post. ↩

  8. I've made the timing diagram somewhat idealized so actions line up with the clock. In the real datasheet, all the signals are skewed by various amounts so the timing is more complicated. Moreover, if the memory device is slow, it can insert "wait" states between T3 and T4. (Cheap memory was slower and would need wait states.) Moreover, actions don't exactly line up with the clock. I'm also omitting various control signals. The datasheet has pages of timing constraints on exactly when signals can change. ↩

  9. Instruction prefetches don't use the IND and OPR registers. Instead, the address is specified by the Instruction Pointer (or Program Counter), and the data is stored directly into one of the instruction prefetch registers. ↩

  10. A single memory operation takes six clock cycles: two preparatory cycles to compute the address before the four visible cycles. However, if multiple memory operations are performed, the operations are overlapped to achieve a degree of pipelining. Specifically, the address calculation for the next memory operation takes place during the last two clock cycles of the current memory operation, saving two clock cycles. That is, for consecutive bus cycles, T3 and T4 overlap with TS and T0 of the next cycle. In other words, during T3 and T4 of one bus cycle, the memory address gets computed for the next bus cycle. This pipelining improves performance, compared to taking 6 clock cycles for each bus cycle. ↩

  11. The POP operation is an example of how the address adder updates a memory pointer. In this case, the stack address is moved from the Stack Pointer to the IND register in order to perform the memory read. As part of the read operation, the IND register is incremented by 2. The address is then moved from the IND register to the Stack Pointer. Thus, the address adder not only performs the segment arithmetic, but also computes the new value for the SP register.

    Note that the increment/decrement of the IND register happens after the memory operation. For stack operations, the SP must be decremented before a PUSH and incremented after a POP. The adder cannot perform a predecrement, so the PUSH instruction uses the ALU (Arithmetic/Logic Unit) to perform the decrement. ↩

  12. The current from an MOS transistor is proportional to the width of the gate divided by the length (the W/L ratio). Since the minimum gate length is set by the manufacturing process, increasing the width of the gate (and thus the overall size of the transistor) is how the transistor's current is increased. ↩

  13. Using one transistor to pull the output high and one to pull the output low is normal for CMOS gates, but it is unusual for NMOS chips like the 8086. A normal NMOS gate only has active transistor to pull the output low and uses a depletion-mode transistor to provide a weak pull-up current, similar to a pull-up resistor. I discuss superbuffers in more detail here. ↩

  14. The flip-flop is controlled by the inverted clock signal, so the output will change when the clock goes low. Meanwhile, the enable signal is dynamically latched by a MOSFET, also controlled by the inverted clock. (When the clock goes high, the previous value will be retained by the gate capacitance of the inverter.) ↩

  15. The superbuffer AND gates are constructed on the same principle as the regular superbuffer, except with two inputs. Two transistors in series pull the output high if both inputs are high. Two transistors in parallel pull the output low if either input is low. The low-side transistors are driven by inverted signals. I haven't drawn these signals on the schematic to simplify it.

    The superbuffer AND gates use large transistors, but not as large as the output transistors, providing an intermediate amplification stage between the small internal signals and the large external signals. Because of the high capacitance of the large output transistors, they need to be driven with larger signals. There's a lot of theory behind how transistor sizes should be scaled for maximum performance, described in the book Logical Effort. Roughly speaking, for best performance when scaling up a signal, each stage should be about 3 to 4 times as large as the previous one, so a fairly large number of stages are used (page 21). The 8086 simplifies this with two stages, presumably giving up a bit of performance in exchange for keeping the drivers smaller and simpler. ↩

  16. The enable circuitry has some complications. For instance, I think the address pins will be enabled if a cycle was going to be T1 for a prefetch but then got preempted by a memory operation. The bus control logic is fairly complicated. ↩

  17. The multiplexer is implemented with pass transistors, rather than gates. One of the pass transistors is turned on to pass that value through to the multiplexer's output. The flip-flop is implemented with two pass transistors and two inverters in alternating order. The first pass transistor is activated by the clock and the second by the complemented clock. When a pass transistor is off, its output is held by the gate capacitance of the inverter, somewhat like dynamic RAM. This is one reason that the 8086 has a minimum clock speed: if the clock is too slow, these capacitively-held values will drain away. ↩

  18. The status outputs on the address pins are defined as follows: A16/S3, A17/S4: these two status lines indicate which relocation register is being used for the memory access, i.e. the stack segment, code segment, data segment, or alternate segment. Theoretically, a system could use a different memory bank for each segment and increase the total memory capacity to 4 megabytes.
    A18/S5: indicates the status of the interrupt enable bit. In order to provide the most up-to-date value, this pin has a different circuit. It is updated at the beginning of each clock cycle, so it can change during a bus cycle. The motivation for this is presumably so peripherals can determine immediately if the interrupt enable status changes.
    A19/S6: the documentation calls this a status output, even though it always outputs a status of 0. ↩

  19. For a read, the enable signal is activated at the end of T3 and the beginning of T4 to transfer the data value to the AD bus. The signal is gated by the READY pin, so the read doesn't happen until the external device is ready. The 8086 will insert Tw wait states in that case. ↩

  20. The datasheet says that a data value must be held steady for 10 nanoseconds (TCLDX) after the clock goes low at the start of T4. ↩

  21. The design of the AD bus is a bit unusual since the adder will put a value on the AD bus when the clock is high, while the data pin will put a value on the AD bus when the clock is low (while otherwise precharging it when the clock is low). Usually the bus is precharged during one clock phase and all users of the bus pull it low (for a 0) during the other phase. ↩

  22. Federico Faggin's oral history is here. The relevant part is on pages 55 and 56. ↩

  23. The Texas Instruments TMS9900 (1976) used a 64-pin package for instance, as did the Motorola 68000 (1979). ↩

The Group Decode ROM: The 8086 processor's first step of instruction decoding

13 May 2023 at 22:42

A key component of any processor is instruction decoding: analyzing a numeric opcode and figuring out what actions need to be taken. The Intel 8086 processor (1978) has a complex instruction set, making instruction decoding a challenge. The first step in decoding an 8086 instruction is something called the Group Decode ROM, which categorizes instructions into about 35 types that control how the instruction is decoded and executed. For instance, the Group Decode ROM determines if an instruction is executed in hardware or in microcode. It also indicates how the instruction is structured: if the instruction has a bit specifying a byte or word operation, if the instruction has a byte that specifies the addressing mode, and so forth.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The diagram above shows the position of the Group Decode ROM on the silicon die, as well as other key functional blocks. The 8086 chip is partitioned into a Bus Interface Unit that communicates with external components such as memory, and the Execution Unit that executes instructions. Machine instructions are fetched from memory by the Bus Interface Unit and stored in the prefetch queue registers, which hold 6 bytes of instructions. To execute an instruction, the queue bus transfers an instruction byte from the prefetch queue to the instruction register, under control of a state machine called the Loader. Next, the Group Decode ROM categorizes the instruction according to its structure. In most cases, the machine instruction is implemented in low-level microcode. The instruction byte is transferred to the Microcode Address Register, where the Microcode Address Decoder selects the appropriate microcode routine that implements the instruction. The microcode provides the micro-instructions that control the Arithmetic/Logic Unit (ALU), registers, and other components to execute the instruction.

In this blog post, I will focus on a small part of this process: how the Group Decode ROM decodes instructions. Be warned that this post gets down into the weeds, so you might want to start with one of my higher-level posts, such as how the 8086's microcode engine works.

Microcode

Most instructions in the 8086 are implemented in microcode. Most people think of machine instructions as the basic steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the CPU's control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.

Microcode is only used if the Group Decode ROM indicates that the instruction is implemented in microcode. In that case, the microcode address register is loaded with the instruction and the address decoder selects the appropriate microcode routine. However, there's a complication. If the second byte of the instruction is a Mod R/M byte, the Group Decode ROM indicates this and causes a memory addressing micro-subroutine to be called.

Some simple instructions are implemented entirely in hardware and don't use microcode. These are known as 1-byte logic instructions (1BL) and are also indicated by the Group Decode ROM.

The Group Decode ROM's structure

The Group Decode ROM takes an 8-bit instruction as input, along with an interrupt signal. It produces 15 outputs that control how the instruction is handled. In this section I'll discuss the physical implementation of the Group Decode ROM; the various outputs are discussed in a later section.

Although the Group Decode ROM is called a ROM, its implementation is really a PLA (Programmable Logic Array), two levels of highly-structured logic gates.1 The idea of a PLA is to create two levels of NOR gates, each in a grid. This structure has the advantages that it implements the logic densely and is easy to modify. Although physically two levels of NOR gates, a PLA can be thought of as an AND layer followed by an OR layer. The AND layer matches particular bit patterns and then the OR layer combines multiple values from the first layer to produce arbitrary outputs.

The Group Decode ROM. This photo shows the metal layer on top of the die.

The Group Decode ROM. This photo shows the metal layer on top of the die.

Since the output values are highly structured, a PLA implementation is considerably more efficient than a ROM, since in a sense it combines multiple entries. In the case of the Group Decode ROM, using a ROM structure would require 256 columns (one for each 8-bit instruction pattern), while the PLA implementation requires just 36 columns, about 1/7 the size.

The diagram below shows how one column of the Group Decode ROM is wired in the "AND" plane. In this die photo, I removed the metal layer with acid to reveal the polysilicon and silicon underneath. The vertical lines show where the metal line for ground and the column output had been. The basic idea is that each column implements a NOR gate, with a subset of the input lines selected as inputs to the gate. The pull-up resistor at the top pulls the column line high by default. But if any of the selected inputs are high, the corresponding transistor turns on, connecting the column line to ground and pulling it low. Thus, this implements a NOR gate. However, it is more useful to think of it as an AND of the complemented inputs (via De Morgan's Law): if all the inputs are "correct", the output is high. In this way, each column matches a particular bit pattern.

Closeup of a column in the Group Decode ROM.

Closeup of a column in the Group Decode ROM.

The structure of the ROM is implemented through the silicon doping pattern, which is visible above. A transistor is formed where a polysilicon wire crosses a doped silicon region: the polysilicon forms the gate, turning the transistor on or off. At each intersection point, a transistor can be created or not, depending on the doping pattern. If a particular transistor is created, then the corresponding input must be 0 to produce a high output.

At the top of the diagram above, the column outputs are switched from the metal layer to polysilicon wires and become the inputs to the upper "OR" plane. This plane is implemented in a similar fashion as a grid of NOR gates. The plane is rotated 90 degrees, with the inputs vertical and each row forming an output.

Intermediate decoding in the Group Decode ROM

The first plane of the Group Decode ROM categorizes instructions into 36 types based on the instruction bit pattern.2 The table below shows the 256 instruction values, colored according to their categorization.3 For instance, the first blue block consists of the 32 ALU instructions corresponding to the bit pattern 00XXX0XX, where X indicates that the bit can be 0 or 1. These instructions are all decoded and executed in a similar way. Almost all instructions have a single category, that is, they activate a single column line in the Group Decode ROM. However, a few instructions activate two lines and have two colors below.

Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.

Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.

Note that the instructions do not have arbitrary numeric opcodes, but are assigned in a way that makes decoding simpler. Because these blocks correspond to bit patterns, there is little flexibility. One of the challenges of instruction set design for early microprocessors was to assign numeric values to the opcodes in a way that made decoding straightforward. It's a bit like a jigsaw puzzle, fitting the instructions into the 256 available values, while making them easy to decode.

Outputs from the Group Decode ROM

The Group Decode ROM has 15 outputs, one for each row of the upper half. In this section, I'll briefly discuss these outputs and their roles in the 8086. For an interactive exploration of these signals, see this page, which shows the outputs that are triggered by each instruction.

Out 0 indicates an IN or OUT instruction. This signal controls the M/IO (S2) status line, which distinguishes between a memory read/write and an I/O read/write. Apart from this, memory and I/O accesses are basically the same.

Out 1 indicates (inverted) that the instruction has a Mod R/M byte and performs a read/modify/write on its argument. This signal is used by the Translation ROM when dispatching an address handler (details). (This signal distinguishes between, say, ADD [AX],BX and MOV [AX],BX. The former both reads and writes [AX], while the latter only writes to it.)

Out 2 indicates a "group 3/4/5" opcode, an instruction where the second byte specifies the particular instruction, and thus decoding needs to wait for the second byte. This controls the loading of the microcode address register.

Out 3 indicates an instruction prefix (segment, LOCK, or REP). This causes the next byte to be decoded as a new instruction, while blocking interrupt handling.

Out 4 indicates (inverted) a two-byte ROM instruction (2BR), i.e. an instruction is handled by the microcode ROM, but requires the second byte for decoding. This is an instruction with a Mod R/M byte. This signal controls the loader indicating that it needs to fetch the second byte. This signal is almost the same as output 1 with a few differences.

Out 5 specifies the top bit for an ALU operation. The 8086 uses a 5-bit field to specify an ALU operation. If not specified explicitly by the microcode, the field uses bits 5 through 3 of the opcode. (These bits distinguish, say, an ADD instruction from AND or SUB.) This control line sets the top bit of the ALU field for instructions such as DAA, DAS, AAA, AAS, INC, and DE that fall into a different set from the "regular" ALU instructions.

Out 6 indicates an instruction that sets or clears a condition code directly: CLC, STC, CLI, STI, CLD, or STD (but not CMC). This signal is used by the flag circuitry to update the condition code.

Out 7 indicates an instruction that uses the AL or AX register, depending on the instruction's size bit. (For instance MOVSB vs MOVSW.) This signal is used by the register selection circuitry, the M register specifically.

Out 8 indicates a MOV instruction that uses a segment register. This signal is used by the register selection circuitry, the N register specifically.

Out 9 indicates the instruction has a d bit, where bit 1 of the instruction swaps the source and destination. This signal is used by the register selection circuitry, swapping the roles of the M and N registers according to the d bit.

Out 10 indicates a one-byte logic (1BL) instruction, a one-byte instruction that is implemented in logic, not microcode. These instructions are the prefixes, HLT, and the condition-code instructions. This signal controls the loader, causing it to move to the next instruction.

Out 11 indicates instructions where bit 0 is the byte/word indicator. This signal controls the register handling and the ALU functionality.

Out 12 indicates an instruction that operates only on a byte: DAA, DAS, AAA, AAS, AAM, AAD, and XLAT. This signal operates in conjunction with the previous output to select a byte versus word.

Out 13 forces the instruction to use a byte argument if instruction bit 1 is set, overriding the regular byte/word pattern. Specifically, it forces the L8 (length 8 bits) condition for the JMP direct-within-segment and the ALU instructions that are immediate with sign extension (details).

Out 14 allows a carry update. This prevents the carry from being updated by the INC and DEC operations. This signal is used by the flag circuitry.

Columns

Most of the Group Decode ROM's column signals are used to derive the outputs listed above. However, some column outputs are also used as control signals directly. These are listed below.

Column 10 indicates an immediate MOV instruction. These instructions use instruction bit 3 (rather than bit 1) to select byte versus word, because the three low bits specify the register. This signal affects the L8 condition described earlier and also causes the M register selection to be converted from a word register to a byte register if necessary.

Column 12 indicates an instruction with bits 5-3 specifying the ALU instruction. This signal causes the X register to be loaded with the bits in the instruction that specify the ALU operation. (To be precise, this signal prevents the X register from being reloaded from the second instruction byte.)

Column 13 indicates the CMC (Complement Carry) instruction. This signal is used by the flags circuitry to complement the carry flag (details).

Column 14 indicates the HLT (Halt) instruction. This signal stops instruction processing by blocking the instruction queue.

Column 31 indicates a REP prefix. This signal causes the REPZ/NZ latch to be loaded with instruction bit 0 to indicate if the prefix is REPNZ or REPZ. It also sets the REP latch.

Column 32 indicates a segment prefix. This signal loads the segment latches with the desired segment type.

Column 33 indicates a LOCK prefix. It sets the LOCK latch, locking the bus.

Column 34 indicates a CLI instruction. This signal immediately blocks interrupt handling to avoid an interrupt between the CLI instruction and when the interrupt flag bit is cleared.

Timing

One important aspect of the Group Decode ROM is that its outputs are not instantaneous. It takes a clock cycle to get the outputs from the Group Decode ROM. In particular, when instruction decoding starts, the timing signal FC (First Clock) is activated to indicate the first clock cycle. However, the Group Decode ROM's outputs are not available until the Second Clock SC.

One consequence of this is that even the simplest instruction (such as a flag operation) takes two clock cycles, as does a prefix. The problem is that even though the instruction could be performed in one clock cycle, it takes two clock cycles for the Group Decode ROM to determine that the instruction only needs one cycle. This illustrates how a complex instruction format impacts performance.

The FC and SC timing signals are generated by a state machine called the Loader. These signals may seem trivial, but there are a few complications. First, the prefetch queue may run empty, in which case the FC and/or SC signal is delayed until the prefetch queue has a byte available. Second, to increase performance, the 8086 can start decoding an instruction during the last clock cycle of the previous instruction. Thus, if the microcode indicates that there is one cycle left, the Loader can proceed with the next instruction. Likewise, for a one-byte instruction implemented in hardware (one-byte logic or 1BL), the loader proceeds as soon as possible.

The diagram below shows the timing of an ADD instruction. Each line is half of a clock cycle. Execution is pipelined: the instruction is fetched during the first clock cycle (First Clock). During Second Clock, the Group Decode ROM produces its output. The microcode address register also generates the micro-address for the instruction's microcode. The microcode ROM supplies a micro-instruction during the third clock cycle and execution of the micro-instruction takes place during the fourth clock cycle.

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character Β΅ is short for "micro".

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character Β΅ is short for "micro".

The Group Decode ROM's outputs during Second Clock control the decoding. Most importantly, the ADD imm instruction used microcode; it is not a one-byte logic instruction (1BL). Moreover, it does not have a Mod R/M byte, so it does not need two bytes for decoding (2BR). For a 1BL instruction, microcode execution would be blocked and the next instruction would be immediately fetched. On the other hand, for a 2BR instruction, the loader would tell the prefetch queue that it was done with the second byte during the second half of Second Clock. Microcode execution would be blocked during the third cycle and the fourth cycle would execute a microcode subroutine to determine the memory address.

For more details, see my article on the 8086 pipeline.

Interrupts

The Group Decode ROM takes the 8 bits of the instruction as inputs, but it has an additional input indicating that an interrupt is being handled. This signal blocks most of the Group Decode ROM outputs. This prevents the current instruction's outputs from interfering with interrupt handling. I wrote about the 8086's interrupt handling in detail here, so I won't go into more detail in this post.

Conclusions

The Group Decode ROM indicates one of the key differences between CISC processors (Complex Instruction Set Computer) such as the 8086 and the RISC processors (Reduced Instruction Set Computer) that became popular a few years later. A RISC instruction set is designed to make instruction decoding very easy, with a small number of uniform instruction forms. On the other hand, the 8086's CISC instruction set was designed for compactness and high code density. As a result, instructions are squeezed into the available opcode space. Although there is a lot of structure to the 8086 opcodes, this structure is full of special cases and any patterns only apply to a subset of the instructions. The Group Decode ROM brings some order to this chaotic jumble of instructions, and the number of outputs from the Group Decode ROM is a measure of the instruction set's complexity.

The 8086's instruction set was extended over the decades to become the x86 instruction set in use today. During that time, more layers of complexity were added to the instruction set. Now, an x86 instruction can be up to 15 bytes long with multiple prefixes. Some prefixes change the register encoding or indicate a completely different instruction set such as VEX (Vector Extensions) or SSE (Streaming SIMD Extensions). Thus, x86 instruction decoding is very difficult, especially when trying to decode multiple instructions in parallel. This has an impact in modern systems, where x86 processors typically have 4 complex instruction decoders while Apple's ARM processors have 8 simpler decoders; this is said to give Apple a performance benefit. Thus, architectural decisions from 45 years ago are still impacting the performance of modern processors.

I've written numerous posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @kenshirriff@oldbytes.space. Thanks to Arjan Holscher for suggesting this topic.

Notes and references

  1. You might wonder what the difference is between a ROM and a PLA. Both of them produce arbitrary outputs for a set of inputs. Moreover, you can replace a PLA with a ROM or vice versa. Typically a ROM has all the input combinations decoded, so it has a separate row for each input value, i.e. 2N rows. So you can think of a ROM as a fully-decoded PLA.

    Some ROMs are partially decoded, allowing identical rows to be combined and reducing the size of the ROM. This technique is used in the 8086 microcode, for instance. A partially-decoded ROM is fairly similar to a PLA, but the technical distinction is that a ROM has only one output row active at a time, while a PLA can have multiple output rows active and the results are OR'd together. (This definition is from The Architecture of Microprocessors p117.)

    The Group Decode ROM, however, has a few cases where multiple rows are active at the same time (for instance the segment register POP instructions). Thus, the Group Decode ROM is technically a PLA and not a ROM. This distinction isn't particularly important, but you might find it interesting. ↩

  2. The Group Decode ROM has 38 columns, but two columns (11 and 35) are unused. Presumably, these were provided as spares in case a bug fix or modification required additional decoding. ↩

  3. Like the 8008 and 8080, the 8086's instruction set was designed around a 3-bit octal structure. Thus, the 8086 instruction set makes much more sense if viewed in octal instead of hexadecimal. The table below shows the instructions with an octal organization. Each 8Γ—8 block uses the two low octal digits, while the four large blocks are positioned according to the top octal digit (labeled). As you can see, the instruction set has a lot of structure that is obscured in the usual hexadecimal table.

    The 8086 instruction set, put in a table according to the octal opcode value.

    The 8086 instruction set, put in a table according to the octal opcode value.

    For details on the octal structure of the 8086 instruction set, see The 80x86 is an Octal Machine. ↩

❌
❌