

CS161 – Design and Architecture of Computer Systems Multi-Cycle CPU Design

UNIVERSITY OF CALIFORNIA, RIVERSIDE

## Single Cycle Implementation



- > Calculate cycle time assuming negligible delays except:
  - memory (2ns), ALU and adders (2ns), register file access (1ns)





| Inst.<br>Type |                      | Functional Units Used |     |                   |                |  |  |  |  |  |
|---------------|----------------------|-----------------------|-----|-------------------|----------------|--|--|--|--|--|
| R-type        | Instruction<br>fetch | Register<br>read      | ALU | Register<br>write |                |  |  |  |  |  |
| Load          | Instruction<br>fetch | Register<br>read      | ALU | Memory<br>access  | Register write |  |  |  |  |  |
| Store         | Instruction<br>fetch | Register<br>read      | ALU | Memory<br>access  |                |  |  |  |  |  |
| Branch        | Instruction<br>fetch | Register<br>read      | ALU |                   |                |  |  |  |  |  |
| Jump          | Instruction<br>fetch |                       |     |                   |                |  |  |  |  |  |

#### Single Cycle – How long is the cycle?

| Inst.<br>Type | Inst.<br>Mem. | Reg. File<br>(read) | ALU<br>(s) | Data<br>Mem. | Reg. File<br>(write) | Total | Inst. % |
|---------------|---------------|---------------------|------------|--------------|----------------------|-------|---------|
| R-type        | 2             | 1                   | 2          | 0            | 1                    | 6 ns  | 44      |
| Load          | 2             | 1                   | 2          | 2            | 1                    | 8 ns  | 24      |
| Store         | 2             | 1                   | 2          | 2            | 0                    | 7 ns  | 12      |
| Branch        | 2             | 1                   | 2          | 0            | 0                    | 5 ns  | 18      |
| Jump          | 2             | 0                   | 0          | 0            | 0                    | 2 ns  | 2       |

The cycle time must accommodate the longest operation: lw. Cycle time = 8 ns but the CPI = 1.

If we can accommodate variable number of cycles for each instruction and a cycle time of 1ns. CPI = 6\*44% + 8\*24% + 7\*12% + 5\*18% + 2\*2% = 6.3

#### Where we are headed



- > Single Cycle Problems:
  - > what if we had a more complicated instruction like floating point?
  - waste of area
- > One Solution:
  - use a "smaller" cycle time
  - have different instructions take different numbers of cycles
  - > a "multicycle" datapath:



## **Multicycle Approach**



- > We will be reusing functional units
  - > ALU used to compute address and to increment PC
  - Memory used for instruction and data
- Our control signals will not be determined solely by instruction
  - e.g., what should the ALU do for a "subtract" instruction?
- > We'll use a finite state machine for control

#### **Review: finite state machines**



- > Finite state machines:
  - > a set of states and
  - > next state function (determined by current state and the input)
  - > output function (determined by current state and possibly input)



> We'll use a Moore machine (output based only on current state)

#### **Multicycle Approach**



- > Break up the instructions into steps, each step takes a cycle
  - balance the amount of work to be done
  - restrict each cycle to use only one major functional unit
- > At the end of a cycle
  - store values for use in later cycles (easiest thing to do)
  - introduce additional "internal" registers



#### **Muticyle Datapath**





### **Five Execution Steps**



- Instruction Fetch (F)
- Instruction Decode and Register Fetch (D)
- Execution, Memory Address Computation, or Branch Completion (EX)
- Memory Access or R-type instruction completion (M)
- Write-back step (W)

#### INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

## **Step 1: Instruction Fetch**



- □ Use PC to get instruction and put it in the Instruction Register.
- Increment the PC by 4 and put the result back in the PC.
- Can be described concisely using RTL "Register-Transfer Language"

IR = Memory[PC]; PC = PC + 4;

• Can we figure out the values of the control signals?

#### **Datapath of Multicycle Implementation**



#### Step 2: Instruction Decode and Register Fetch

- Read registers rs and rt in case we need them
- Compute the branch address in case the instruction is a branch
- > RTL:

```
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC + (sign-extend(IR[15-0]) << 2);</pre>
```

 We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

## Single Cycle Implementation



JLR

#### **Datapath of Multicycle Implementation**



## Step 3 (instruction dependent) UCR

- > ALU is performing one of three functions, based on instruction type
- > Memory Reference:

ALUOut = A + sign-extend(IR[15-0]);

> R-type:

ALUOut = A op B;

> Branch:

if (A==B) PC = ALUOut;

#### **Datapath of Multicycle Implementation**



#### Step 4 (R-type or memory-access)



Loads and stores access memory

MDR = Memory[ALUOut];
 or
Memory[ALUOut] = B;

R-type instructions finish

Reg[IR[15-11]] = ALUOut;

The write actually takes place at the end of the cycle on the edge

#### **Datapath of Multicycle Implementation**



#### **Step 5 Write-back step**



 $\Box$  Reg[IR[20-16]] = MDR;

What about all the other instructions?





| Step name                                                                                                         | Action for R-type<br>instructions | Action for memory-reference<br>instructions                    | Action for branches            | Action for<br>jumps                 |  |  |
|-------------------------------------------------------------------------------------------------------------------|-----------------------------------|----------------------------------------------------------------|--------------------------------|-------------------------------------|--|--|
| Instruction fetch                                                                                                 | IR = Memory[PC]<br>PC = PC + 4    |                                                                |                                |                                     |  |  |
| InstructionA = Reg [IR[25-21]]decode/register fetchB = Reg [IR[20-16]]ALUOut = PC + (sign-extend (IR[15-0]) << 2) |                                   |                                                                |                                |                                     |  |  |
| Execution, address<br>computation, branch/<br>jump completion                                                     | ALUOut = A op B                   | ALUOut = A + sign-extend<br>(IR[15-0])                         | if (A ==B) then<br>PC = ALUOut | PC = PC [31-28] II<br>(IR[25-0]<<2) |  |  |
| Memory access or R-type completion                                                                                | Reg [IR[15-11]] =<br>ALUOut       | Load: MDR = Memory[ALUOut]<br>or<br>Store: Memory [ALUOut] = B |                                |                                     |  |  |
| Memory read completion                                                                                            |                                   | Load: Reg[IR[20-16]] = MDR                                     |                                |                                     |  |  |

#### **Simple Questions**



• How many cycles will it take to execute this code?



- What is going on during the 8th cycle of execution?
- In what cycle does the actual addition of \$t2 and \$t3 takes place?

## Implementing the Control



- Value of control signals is dependent upon:
  - what instruction is being executed
  - which step is being performed
- Use the information we've accumulated to specify a finite state machine
  - specify the finite state machine graphically, or
  - use microprogramming
- Implementation can be derived from specification

#### **Graphical Specification of FSM**





#### **Detailed FSM**





#### **Finite State Machine for Control**



#### □ Implementation:







Inputs





## **PLA Implementation**

UCR

- Programmable Logic Array
   If I picked a horizontal
  - or vertical line could you explain it?



## **ROM Implementation**



- ROM = "Read Only Memory"
  - values of memory locations are fixed ahead of time
- A ROM can be used to implement a truth table
  - **if the address is m-bits, we can address 2<sup>m</sup> entries in the ROM.**
  - > our outputs are the bits of data that the address points to.



| 0      | 0      | 0 | 0<br>1<br>1<br>0<br>0<br>0<br>0 | 0 | 1 | 1 |
|--------|--------|---|---------------------------------|---|---|---|
| 0      | 0      | 1 | 1                               | 1 | 0 | 0 |
| 0      | 1      | 0 | 1                               | 1 | 0 | 0 |
| 0      | 1      | 1 | 1                               | 0 | 0 | 0 |
| 1      | 0      | 0 | 0                               | 0 | 0 | 0 |
| 1<br>1 | 0<br>1 | 1 | 0                               | 0 | 0 | 1 |
| 1      | 1      | 0 | 0                               | 1 | 1 | 0 |
| 1      | 1      | 1 | 0                               | 1 | 1 | 1 |

m is the "heigth", and n is the "width"

## **ROM Implementation**



- How many inputs are there?
  - 6 bits for opcode, 4 bits for state = 10 address lines
  - (i.e.,  $2^{10} = 1024$  different addresses)
- How many outputs are there?
   16 datapath-control outputs, 4 state bits = 20 outputs
- ROM is 2<sup>10</sup> x 20 = 20K bits
- Rather wasteful, since for lots of the entries, the outputs are the same
  - i.e., opcode is often ignored

#### **ROM vs PLA**

UCR

- > Break up the table into two parts
  - 4 state bits tell you the 16 outputs,  $2^4 \times 16$  bits of ROM
  - 10 bits tell you the 4 next state bits,  $2^{10} \times 4$  bits of ROM
  - Total: 4.3K bits of ROM
- > PLA is much smaller
  - can share product terms
  - only need entries that produce an active output
  - can take into account don't cares
- Size is (#inputs × #product-terms) + (#outputs × #product-terms)
   For this example = (10x17)+(20x17) = 460 PLA cells
- > PLA cells usually about the size of a ROM cell (slightly bigger)

## **Another Implementation Style**



Complex instructions: the "next state" is often current state + 1



#### **Details**

| Dispatch ROM 1 |             |       |  |  |  |  |  |
|----------------|-------------|-------|--|--|--|--|--|
| Ор             | Opcode name | Value |  |  |  |  |  |
| 000000         | R-format    | 0110  |  |  |  |  |  |
| 000010         | jmp         | 1001  |  |  |  |  |  |
| 000100         | beq         | 1000  |  |  |  |  |  |
| 100011         | lw          | 0010  |  |  |  |  |  |
| 101011         | SW          | 0010  |  |  |  |  |  |

|         |        | Dispatch ROM 2  |          |            |             |        |                |        |
|---------|--------|-----------------|----------|------------|-------------|--------|----------------|--------|
| 4 1     | Ор     | Opcode name     | Valu     |            |             | -      |                |        |
|         | 100011 | lw              | 001      |            |             |        |                |        |
|         | 101011 | SW              | 010      |            |             |        |                |        |
|         |        |                 |          | PLA        | or ROM      |        |                |        |
|         |        |                 |          |            | Î           |        |                |        |
|         |        | 1               |          |            |             |        |                |        |
|         |        |                 | Г        |            |             |        |                |        |
|         |        |                 | L        | St         | tate        |        |                |        |
|         |        | Adder           |          |            |             |        |                | 1      |
|         |        |                 |          |            | lux         |        |                | AddrCt |
|         |        |                 | (        | 3 2        | 1 0         |        |                |        |
|         |        |                 |          | †          |             |        |                |        |
|         |        |                 | -        |            | 0           |        |                |        |
|         |        |                 |          |            |             | L      | -              |        |
|         |        |                 | Dispatch | ROM 2      | Dispatch    | ROM 1  |                |        |
|         |        |                 | 1        |            | 1           | Addroo | s select logic |        |
|         |        |                 |          |            |             | Addres |                |        |
|         |        |                 | _        | dO         |             |        |                |        |
| l aati  | on     | Value of AddrCt |          | Instructio | on register |        |                |        |
| ol acti | 011    |                 |          | орсо       | de field    | Į      |                |        |
|         |        | 3               |          |            |             |        |                |        |
|         |        | 1               |          |            |             |        |                |        |
|         |        | 2               |          |            |             |        |                |        |
|         |        | 3               |          |            |             |        |                |        |
|         |        | 0               |          |            |             |        |                |        |
|         |        | 0               |          |            |             |        |                |        |
|         |        | 3               |          |            |             |        |                |        |
|         |        | 0               |          |            |             |        |                |        |
|         |        |                 |          |            |             |        |                |        |

| State number | Address-control action    | Value of AddrCtl |
|--------------|---------------------------|------------------|
| 0            | Use incremented state     | 3                |
| 1            | Use dispatch ROM 1        | 1                |
| 2            | Use dispatch ROM 2        | 2                |
| 3            | Use incremented state     | 3                |
| 4            | Replace state number by 0 | 0                |
| 5            | Replace state number by 0 | 0                |
| 6            | Use incremented state     | 3                |
| 7            | Replace state number by 0 | 0                |
| 8            | Replace state number by 0 | 0                |
| 9            | Replace state number by 0 | 0                |

#### Microprogramming





## Microprogramming

- How all CISC ISAs were built.
- µ-store: RAM-memory with N bits/word
  - N = number of control lines in CPU.
  - Execution of each instruction starts IF and ID.
  - Instruction Opcode points to a location in µ-store
  - Output consecutive control words until DONE.
- µ-store is updatable
- µ-controller is = a mini-CPU running the main CPU
- M-programming worked when CPU = multiple boards
  - µ-controller on one board: much faster



## Microprogramming



- A specification methodology
  - > appropriate if hundreds of opcodes, modes, cycles, etc.
  - signals specified symbolically using microinstructions

| Label    | ALU<br>control | SRC1 | SRC2    | Register<br>control | Memory    | PCWrite<br>control | Sequencing |
|----------|----------------|------|---------|---------------------|-----------|--------------------|------------|
| Fetch    | Add            | PC   | 4       |                     | Read PC   | ALU                | Seq        |
|          | Add            | PC   | Extshft | Read                |           |                    | Dispatch 1 |
| Mem1     | Add            | А    | Extend  |                     |           |                    | Dispatch 2 |
| LW2      |                |      |         |                     | Read ALU  |                    | Seq        |
|          |                |      |         | Write MDR           |           |                    | Fetch      |
| SW2      |                |      |         |                     | Write ALU |                    | Fetch      |
| Rformat1 | Func code      | А    | В       |                     |           |                    | Seq        |
|          |                |      |         | Write ALU           |           |                    | Fetch      |
| BEQ1     | Subt           | А    | В       |                     |           | ALUOut-cond        | Fetch      |
| JUMP1    |                |      |         |                     |           | Jump address       | Fetch      |

- □ Will two implementations of the same architecture have the same microcode?
- What would a microassembler do?

#### Microinstruction format



| Field name          | Value        | Signals active                           | Comment                                                                                                                      |
|---------------------|--------------|------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|                     | Add          | ALUOp = 00                               | Cause the ALU to add.                                                                                                        |
| ALU control         | Subt         | ALUOp = 01                               | Cause the ALU to subtract; this implements the compare for branches.                                                         |
|                     | Func code    | ALUOp = 10                               | Use the instruction's function code to determine ALU control.                                                                |
| SRC1                | PC           | ALUSrcA = 0                              | Use the PC as the first ALU input.                                                                                           |
|                     | A            | ALUSrcA = 1                              | Register A is the first ALU input.                                                                                           |
|                     | В            | ALUSrcB = 00                             | Register B is the second ALU input.                                                                                          |
| SRC2                | 4            | ALUSrcB = 01                             | Use 4 as the second ALU input.                                                                                               |
|                     | Extend       | ALUSrcB = 10                             | Use output of the sign extension unit as the second ALU input.                                                               |
|                     | Extshft      | ALUSrcB = 11                             | Use the output of the shift-by-two unit as the second ALU input.                                                             |
|                     | Read         |                                          | Read two registers using the rs and rt fields of the IR as the register numbers and putting the data into registers A and B. |
|                     | Write ALU    | RegWrite,                                | Write a register using the rd field of the IR as the register number and                                                     |
| Register<br>control |              | RegDst = 1,<br>MemtoReg = 0              | the contents of the ALUOut as the data.                                                                                      |
|                     | Write MDR    | RegWrite,<br>RegDst = 0,<br>MemtoReg = 1 | Write a register using the rt field of the IR as the register number and the contents of the MDR as the data.                |
|                     | Read PC      | MemRead,<br>lorD = 0                     | Read memory using the PC as address; write result into IR (and the MDR).                                                     |
| Memory              | Read ALU     | MemRead,<br>lorD = 1                     | Read memory using the ALUOut as address; write result into MDR.                                                              |
|                     | Write ALU    | MemWrite,<br>lorD = 1                    | Write memory using the ALUOut as address, contents of B as the data.                                                         |
|                     | ALU          | PCSource = 00<br>PCWrite                 | Write the output of the ALU into the PC.                                                                                     |
| PC write control    | ALUOut-cond  | PCSource = 01,<br>PCWriteCond            | If the Zero output of the ALU is active, write the PC with the contents of the register ALUOut.                              |
|                     | jump address | PCSource = 10,<br>PCWrite                | Write the PC with the jump address from the instruction.                                                                     |
|                     | Seq          | AddrCtl = 11                             | Choose the next microinstruction sequentially.                                                                               |
| Sequencing          | Fetch        | AddrCtl = 00                             | Go to the first microinstruction to begin a new instruction.                                                                 |
|                     | Dispatch 1   | AddrCtl = 01                             | Dispatch using the ROM 1.                                                                                                    |
|                     | Dispatch 2   | AddrCtl = 10                             | Dispatch using the ROM 2.                                                                                                    |

# Encoded

UCR

- > No encoding:
  - > 1 bit for each datapath operation
  - faster, requires more memory (logic)
  - used for Vax 780 an astonishing 400K of memory!
- Lots of encoding:
  - send the microinstructions through logic to get control signals
  - uses less memory, slower
- > Historical context of CISC:
  - > Too much logic to put on a single chip with everything else
  - > Use a ROM (or even RAM) to hold the microcode
  - > It's easy to add new instructions

#### Microcode: Trade-offs



- Distinction between specification and implementation is sometimes blurred
- Specification Advantages:
  - > Easy to design and write
  - > Design architecture and microcode in parallel
- Implementation (off-chip ROM) Advantages
  - > Easy to change since values are in memory
  - > Can emulate other architectures
  - > Can make use of internal registers
- Implementation Disadvantages, SLOWER now that:
  - > Control is implemented on same chip as processor
  - > ROM is no longer faster than RAM
  - > No need to go back and make changes

#### **The Big Picture**



