6.004 Worksheet
L25 – Cache Coherence

**MSI State Transition Diagram**

- **PrWr / --**
- **PrRd / --**
- **PrRd / BusRdX**
- **BusRd / BusWB**
- **BusRdX / BusWB**
- **BusRdX / --**

**Actions**
- Processor Read (PrRd)
- Processor Write (PrWr)
- Bus Read (BusRd)
- Bus Read Exclusive (BusRdX)
- Bus Writeback (BusWB)

**MESI State Transition Diagram**

- **PrWr / --**
- **PrRd / --**
- **PrRd / BusRd**
- **BusRd / BusWB**
- **BusRdX / BusWB**
- **BusRdX / --**
- **PrRd / BusRd if no other sharers**
- **PrRd / BusRd if other sharers**
**Problem 1.**

A multicore processor has multiple threads each running on a different core. The threads share memory (but has its own stack).

We introduce a new instruction `swap` that supports an atomic read-modify-write operation on a memory location, where the processor’s cache coherence protocol guarantees that no other `swap` operation can access the same memory location at the same time.

\[
\text{swap } rs1, \text{ literal, rd} \\
\text{PC } \leftarrow \text{PC } + \text{4} \\
\text{EA } \leftarrow \text{Reg}[rs1] + \text{literal} \\
\text{TMP } \leftarrow \text{Mem}[\text{EA}] \quad \text{// atomic with following line} \\
\text{Mem}[\text{EA}] \leftarrow \text{Reg}[\text{rd}] \quad \text{// atomic with preceding line} \\
\text{Reg}[\text{rd}] \leftarrow \text{TMP}
\]

The `swap` instruction can implement locks as a binary semaphore.

\[
\text{li } x1, 0 \quad \text{// WAIT operation on lock} \\
\text{loop: swap } x0, \text{ lock, x1} \\
\text{beqz x1, loop} \\
\text{... critical section ...} \\
\text{li } x1, 1 \quad \text{// SIGNAL operation on lock} \\
\text{sw x1, lock, x0}
\]

Using the code segments shown above, write code for the more general `Signal(s)` and `Wait(s)` operations on a semaphore, where `s` is a location in the shared memory. `Signal(s)` and `Wait(s)` need to work even if different cores are calling `Signal` or `Wait` at the same time.

<table>
<thead>
<tr>
<th>Code for Wait(s):</th>
<th>Code for Signal(s):</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>
**Problem 2.**

Snoopy coherence protocols rely on broadcast communication to detect sharing and updates. These are conventionally implemented using bus networks that allow for one message to be sent at a time to all nodes on the network.

(A) Ben Bitdiddle is implementing a bus-based snoopy coherence protocol. One fifth of instructions access memory, and one quarter of these miss in the core’s local cache (either because the line is invalid or doesn’t have necessary permissions). Assuming each memory operation consists of a request and acknowledgement, the network traffic per core is therefore:

\[
\frac{1}{5} \cdot \frac{1}{4} \cdot 1 = \frac{1}{10} \text{ messages/instruction}
\]

Assume all bus messages take a single cycle. Considering only the message carrying capacity of the shared bus, how many cores can the bus support?

(B) Ben needs to build a larger system than the bus network will allow, so he changes the system to use a unidirectional ring network. In this design, the core issuing the memory operation sends the request around the ring, and each node along the way either forwards the request or replaces it with its response. Assuming a single-cycle per hop in the network, at how many cores will this design saturate?
Problem 3.

Ben Bitdiddle is designing a snoopy-based, write-invalidate MSI protocol for write-back caches. Suppose processors P1 and P2 are have private, snoopy caches. Both caches are initially empty. Consider the following sequence of accesses:

I0 P2: read A  
I1 P1: write A  
I2 P2: read A  
I3 P1: write A  
I4 P2: read A  
I5 P2: read B  
I6 P2: read A

(A) Assume blocks A and B do not conflict in the cache. Using the **MSI protocol**, fill in the following table showing the cache line states for A and B after each access.

<table>
<thead>
<tr>
<th>Access</th>
<th>Shared bus transaction</th>
<th>Processor P1’s cache</th>
<th>Processor P2’s cache</th>
</tr>
</thead>
<tbody>
<tr>
<td>Initial state</td>
<td></td>
<td>A: I</td>
<td>B: I</td>
</tr>
<tr>
<td>After P2 reads A</td>
<td></td>
<td>A: I</td>
<td>B: I</td>
</tr>
<tr>
<td>After P1 writes A</td>
<td></td>
<td>A: B</td>
<td>A: B</td>
</tr>
<tr>
<td>After P2 reads A</td>
<td></td>
<td>A: B</td>
<td>A: B</td>
</tr>
<tr>
<td>After P1 writes A</td>
<td></td>
<td>A: B</td>
<td>A: B</td>
</tr>
<tr>
<td>After P2 reads A</td>
<td></td>
<td>A: B</td>
<td>A: B</td>
</tr>
<tr>
<td>After P2 reads B</td>
<td></td>
<td>A: B</td>
<td>A: B</td>
</tr>
<tr>
<td>After P2 reads A</td>
<td></td>
<td>A: B</td>
<td>A: B</td>
</tr>
</tbody>
</table>

(B) If Ben switches to a **MESI protocol**, which bus transactions and cache states would be different?

(C) Briefly describe a scenario where the additional E state in the MESI protocol would eliminate a shared bus trans
Problem 4.

We want to study the tradeoffs between the standard directory-based MSI and MESI coherence protocols. The state transition diagrams for the two protocols are shown on the front page of this handout.

(A) Consider the four-core system below. Each core has a private cache which are kept coherent with a directory protocol. Each core runs a thread that issues a load followed by a store to a single address. Each thread accesses a different address (core 1’s thread accesses A, core 2’s thread accesses B, etc.). These thread-private addresses are on different cache lines. The number in parenthesis indicates the global order of the accesses. Each access completes before the next one begins.

For this sequence of 8 accesses, provide in the table below the total number of each type of bus requests for the MSI and MESI protocols. Assume all caches are initially empty, i.e., the initial states for each cache line is “I”.

<table>
<thead>
<tr>
<th>Protocol</th>
<th># of BusRd</th>
<th># of BusRdX</th>
<th># of BusWB</th>
</tr>
</thead>
<tbody>
<tr>
<td>MSI</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>MESI</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

(B) Consider a different program where each thread reads globally shared data:

For this sequence of 4 accesses, fill in the table below for the MSI and MESI protocols. Ignore coherence responses. Assume all caches are initially empty.

<table>
<thead>
<tr>
<th>Protocol</th>
<th># of BusRd</th>
<th># of BusRdX</th>
<th># of BusWB</th>
</tr>
</thead>
<tbody>
<tr>
<td>MSI</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>MESI</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>