6.004 Tutorial Problems
L20 – Control Hazards in Pipelined Processors
Problem 1. ★

The loop on the right has been executing for a while on our standard 5-stage pipelined RISC-V processor with branch annulment and full bypassing.

(A) **Fill in the pipeline diagram** for cycles 300-309 assuming that at cycle 300 the instruction at L1 is fetched. Also, assume that the branch to L2 is taken, as well as the final branch back to L1. **Indicate which bypass/forwarding paths are active in each cycle by drawing a vertical arrow in the pipeline diagram** from pipeline stage X in a column to the RF stage in the same column if an operand would be bypassed from stage X back to the RF stage that cycle. Note that there may be more than one vertical arrow in a column.

(B) **Fill in pipeline diagram including bypass arrows in pipeline diagram above**

(C) **Indicate which branches are taken by providing the cycle in which the taken branch instruction enters the IF stage.**

**Cycle number(s) or NONE:** __302, 307____

(D) **During which cycle(s), if any, do we have stalled instructions?**

**Cycle number(s) or NONE:** __307, 308____
Now consider a modified processor, P2, which has extra hardware in the decode stage (DEC) to resolve simple branches one cycle earlier: the decode stage includes both a circuit to check whether a register is equal to zero, and an extra adder to compute the branch target for taken branches. This processor can thus compute nextPC for beqz and bnez in DEC instead of EXE.

(E) Redo part A using processor P2 assuming the same path is taken through the code.

(F) Compare the number of cycles per loop iteration using the original processor and the modified processor.

Cycles per loop in original processor: ___12___________

Cycles per loop in processor P2: ___10___________
Problem 2.  ★

You’ve discovered a secret room in the basement of the Stata center full of discarded 5-stage pipelined RISC-V processors. Unfortunately, many have certain defects. You discover that they fall into four categories:

C1: Completely functional 5-stage RISC-V processor with working bypass paths, annulment, and other components.
C2: RISC-V processor with a bad register file: all data read from the register file is zero.
C3: RISC-V processor without bypass paths: all source operands come from the register file.
C4: RISC-V processor without annulment of instructions following branches.

To help sort the processors into the above classes, you write the following small test program:

```assembly
. = 0x0
// Start at 0x0, with ZERO in all registers...
addi x10, x0, 4
jal x12, X
slli x12, x12, 1
X: addi x12, x12, -4
add x13, x12, x10
jr x13
```

Your plan is to single-step through the program using each processor, carefully noting the address the final `jr` loads into the PC. Your goal is to determine which of the above four classes a chip falls into by this `jr` address.

For each class of RISC-V processor described above, specify the value that will be loaded into the PC by the final `jr` instruction.

Pipeline diagram showing first 7 cycles of test program executing on C1:

<table>
<thead>
<tr>
<th>cycle</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>addi</td>
<td>jal</td>
<td>slli</td>
<td>addi</td>
<td>add</td>
<td>jr</td>
<td></td>
</tr>
<tr>
<td>DEC</td>
<td>addi</td>
<td>jal</td>
<td>NOP</td>
<td>addi</td>
<td>add</td>
<td>jr</td>
<td></td>
</tr>
<tr>
<td>EXE</td>
<td>addi</td>
<td>jal</td>
<td>NOP</td>
<td>addi</td>
<td>add</td>
<td></td>
<td></td>
</tr>
<tr>
<td>MEM</td>
<td>addi</td>
<td>jal</td>
<td>NOP</td>
<td>addi</td>
<td>add</td>
<td></td>
<td></td>
</tr>
<tr>
<td>WB</td>
<td>addi</td>
<td>jal</td>
<td>NOP</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

C1: `jr` goes to address: ____8_____

C2: `jr` goes to address: ____4_____

C3: `jr` goes to address: ____0_____

C4: `jr` goes to address: ____16_____

<table>
<thead>
<tr>
<th>. = 0x0</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>// Start at 0x0, with ZERO</td>
<td>C1</td>
<td>C2</td>
<td>C3</td>
<td>C4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>// in all registers...</td>
<td>x10 = 4</td>
<td>x10 = 4</td>
<td>x10 = 4</td>
<td>x10 = 4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>addi x10, x0, 4</td>
<td>x12 = 8</td>
<td>x12 = 8</td>
<td>x12 = 8</td>
<td>x12 = 8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>jal x12, X</td>
<td>x12 = 4</td>
<td>x12 = 4</td>
<td>x12 = -4</td>
<td>x12 = 12</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>slli x12, x12, 1</td>
<td>x13 = 8</td>
<td>x13 = 4</td>
<td>x13 = 4</td>
<td>x12 = 16</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>X: addi x12, x12, -4</td>
<td>want x12 from bypass</td>
<td>want x10 from RF</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>add x13, x12, x10</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Problem 3. ★

(A) How many cycles does it take to run each iteration of the following loop on a standard 5-stage pipelined RISC-V processor?

\[
\text{loop: } \text{lw } x10, 0x100(x0) \\
\quad \text{beqz } x10, \text{loop} \\
\quad \text{add } x12, x10, x11 \\
\quad \text{sub } x13, x12, x1
\]

Number of cycles per loop iteration: \( \boxed{6} \)

(B) Assuming a non-standard 5-stage pipelined RISC-V processor where the instructions following a taken branch are not annulled, which of the following statements would be true?

1. The add instruction would be executed each time through the loop.
2. The loop would take 5 cycles to execute.
3. The value of the register \( x10 \) that is tested by the beqz instruction comes from a bypass path.
4. The value of register \( x10 \) that is accessed by the add instruction comes from the register file.

(C) Consider a modified processor, P2, which has extra hardware for the special case of checking if a register is equal to zero or not in the decode stage. What would be the number of cycles per loop iteration in this case?

Number of cycles per loop iteration on processor P2: \( \boxed{5} \)
(D) Now consider a third processor, P3, whose instruction and data memories are pipelined and take 2 clock cycles to respond. Assume that P3 also has the extra hardware for checking if a register is equal to zero or not in the decode stage. What would be the number of cycles per loop iteration using P3?

**NOP Number of cycles per loop iteration on processor P3: 7**

<table>
<thead>
<tr>
<th>cycle</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
<th>12</th>
<th>13</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>lw</td>
<td>beqz</td>
<td>add</td>
<td>sub</td>
<td>sub</td>
<td>sub</td>
<td>sub</td>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>lw</td>
<td>beqz</td>
<td>add</td>
<td>add</td>
<td>add</td>
<td>add</td>
<td>NOP</td>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DEC</td>
<td>lw</td>
<td>beqz</td>
<td>beqz</td>
<td>beqz</td>
<td>NOP</td>
<td>NOP</td>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EXE</td>
<td>lw</td>
<td>NOP</td>
<td>NOP</td>
<td>NOP</td>
<td>beqz</td>
<td>NOP</td>
<td>NOP</td>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>MEM</td>
<td>lw</td>
<td>NOP</td>
<td>NOP</td>
<td>NOP</td>
<td>beqz</td>
<td>NOP</td>
<td>NOP</td>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>WB</td>
<td>lw</td>
<td>NOP</td>
<td>NOP</td>
<td>NOP</td>
<td>beqz</td>
<td>NOP</td>
<td>NOP</td>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

7 cycles
Problem 4.

You’ve been given a 5-stage pipelined RISC-V processor. Unfortunately, the processor you’ve been given is defective: it has no bypass paths, annulment of instructions in branch delay slots, or pipeline stalls.

```
loop:
    lw x10, 0x0(x10)
AA:
    sll x14, x10, x11
BB:
    bnez x10, loop
CC:
    add x13, x10, x13
```

You undertake to convert some existing code, designed to run on an unpipelined RISC-V, to run on your defective pipelined processor. The scrap of code on above is a sample of the program to be converted. It doesn’t make much sense to you – it doesn’t to us either – but you are to add the minimum number of NOP instructions at the various tagged points in this code to make it give the same results on your defective pipelined RISC-V as it gives on a normal, unpipelined RISC-V.

Note that the code scrap begins and ends with sequences of NOPs; thus you don’t need to worry about pipeline hazards involving interactions with instructions outside of the region shown.

(A) Specify the minimal number of NOP instructions (defined as add x0, x0, x0) to be added at each of the labeled points in the above program.

```
NOPs at Loop: __0____
NOPs at AA: __3____
NOPs at BB: __0____
NOPs at CC: __2____
```

(B) On a fully functional 5-stage pipeline (with working bypass, annul, and stall logic), the above code will run fine with no added NOPs. How many clock cycles of execution time are required by the fully functional 5-stage pipelined RISC-V for each iteration through the loop?

Clocks per loop iteration: __7____