Exercise 3

CSE 7381/5381

Computer Architecture

Note: Don't get scared by the length of the solution - it is very descriptive. Make sure you understand all the problems in HW1 and the sample exam. You will be in good shape, even if you do not read this. Use this mainly as reference material.

Exercise 3.3

The pipeline in this exercise is not specified in the detail that the DLX pipeline was earlier in the text. As a result, unless otherwise noted we assume that the function of the various stages is similar to that of the equivalent DLX stages.

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

ld f0,0(r2) F D X M W

ld f4,0(r3) F D X M W

addi r2,r2,#8 F D X M W

multd f0,f0,f4 F D X X X X X X X M W

sub r5,r4,r2 F D X M W

addd f2,f0,f2 F D S S S S S X X X X M W

bnz r5,loop F S S S S S D X M W

addi r3,r3,#8 F D X M W

ld f0,0(r2) F D S X M W

ld f4,0(r3) F S D X M

addi r2,r2,#8 F D X

Figure 3.8: Pipeline Diagram for a Fully Bypassed DLX Pipeline Executing a Scheduled Version of the FP Loop.

This exercise explores how bypassing, forwarding, and interlocking logic need to be added to prevent stalls in a hypothetical pipeline. The pipeline examined is based on the pipeline used in the VAX 8700 but is slightly simplified for our purposes.

Exercise 3.3(a)

To avoid structural hazards we must ensure that any pipe stage that may require an adder has its own. The simplest way to determine how many adders the architecture requires is to consider each pipeline stage in turn:

IF always requires an adder to increment the PC.
RF and WB do not require adders because these stages only read from and write to the register le.
ALU1 may require an adder to compute the e ective address of some memory instructions.
MEM does not require an adder because it accesses memory and relies on ALU1 to perform any needed address computations.
ALU2 may use an adder to perform ALU operations required by the instruction.

froboz:     add r3,r4,r5 ; (ALU2) r4 + r5
                 nop ; (MEM)
                add r1,r0,0x667(r9) ; (ALU1) r9 + 0x667
                nop ; (RF) nop ; (IF) PC + 4

Figure 3.9: Code Fragment That Requires the Maximum Number of Adders.

From this jaunt down the pipeline, we conclude that in the worst case three adders are required (one in each of the IF, ALU1, and ALU2 stages).

For all three adders to be in use during a single cycle, certain types of instructions must be simultaneously in the IF, ALU1, and ALU2 stages:

IF processes any instruction.
ALU1 processes a memory access instruction that requires an e ective address computation (e.g., add r1,r2,0x10(r3)).
ALU2 processes an instruction requiring an adder to compute its result (e.g., add r1,r2,r3).

Any code sequence that uses all three adders during a single cycle, such as the code presented in Figure 3.9, must have the above three features. In Figure 3.9 each instruction's comment identifies both the stage the instruction is in during the cycle when all three adders are in use and the function performed by the adder. In this code, nop instructions (\no operation") appear in the MEM, RF, and IF pipe stages where it does not matter what instructions are being processed.

Exercise 3.3(b)

To solve this exercise we begin by examining how each pipeline stage behaves with respect to worst-case register read and write port usage:

IF, ALU1, MEM, and ALU2 do not access the register le and therefore do not use register le read or write ports.
RF uses two read ports to determine the values of the Rsrc1 and Rsrc2 registers
WB uses a write port to store the value into the Rdest register.

flotz: add r3,r4,r5 ; (WB) write r3
nop ; (ALU2)
nop ; (MEM)
nop ; (ALU1)
add r0,r1,r2 ; (RF) read r1, r2

Figure 3.10: Code Fragment That Requires a Maximum Number of Register File Read and Write Ports.

In the worst case, this architecture can utilize a total of two read ports and one write port into its register fi le. If fewer than this number of read and write ports is available, a structural hazard can potentially arise.

There are many possible code sequences that can utilize all the register file read and write ports, but all share the following important features during some cycle:

The instruction in RF reads two source operands (e.g., add r0,r1,r2).
The instruction in WB writes to the register le (e.g., sub r1,r2,r3).

A solution that meets these two criteria is shown in Figure 3.10. The comments in this Figure show the pipe stage that each instruction is in when the worst case usage of register file read and write ports occurs. Similar reasoning can be applied to determine the maximum number of memory read and write ports required by the architecture:

IF requires one memory read port to fetch the next instruction.
RF, ALU1, ALU2, and WB do not access memory and therefore do not use memory read or write ports.
MEM may use either a memory read or write port depending on the instruction executed.

The machine always needs one memory read port to support IF and it can potentially require an additional read or write port to memory if the in- struction in MEM requires access to memory. A code sequence that uses the maximum number of ports has the following salient features:

Any instruction may be in IF.
The instruction in MEM must either read-from or write-to memory.

wubba: ld r0,0x10(r1) ; (MEM) mem read to load
nop ; (ALU1)
nop ; (RF)
nop ; (IF) mem read to fetch

Figure 3.11: Code Fragment that Uses the Maximum Number of Memory Read Ports.

Thus, the memory must be able to support a simultaneous read and write or two simultaneous reads to prevent structural hazards. Again, there are many code sequences that will use all of the available memory read and write ports. The code in Figure 3.11 meets these requirements. The comments in this figure show the pipe stage that each instruction is in when the worst case usage of memory read and write ports occurs.

Exercise 3.3(c)

To solve the exercise, we must consider whether a condition can arise where a later instruction in the pipeline requires a result produced by an earlier instruction that has not been \committed" to processor state. For this exercise, we only need to nd the forwarding paths required between the two ALU stages, ALU1 and ALU2. Before going any farther, we should review the function of the ALU1 and ALU2 pipeline stages.

The ALU1 stage computes the effective address of a memory access. ALU1's result provides the address of the memory access required by the instruction to the MEM stage of the pipeline.
The ALU2 stage performs any ALU operations required by an instruction. For example, for an add instruction it is the ALU2 stage that adds the values of the source registers.

With this information, we can determine the forwarding paths required between the two ALU stages of the pipeline. The result produced by ALU1 is used only in the MEM stage of the same instruction to provide the effective address of the access and is never needed by a different instruction in the machine. Therefore, there is no need for forwarding paths from ALU1 to either ALU2 or ALU1. The result of the ALU2 stage can be used in either ALU2 or ALU1 of a later instruction. As a result, we may have to forward results from ALU2 to either ALU2 or ALU1

Instruction	1	2	3	4	5	6	7	8	9
i	IF	RF	A1	M	A2	WB
i+1		IF	RF	A1	M	A2	WB
i+2			IF	RF	A1	M	A2	WB
i+3				IF	RF	A1	M	A2	WB
i+4					IF	RF	...

Figure 3.12: A Pipeline.

of a subsequent instruction if the proper conditions occur. Before discussing these cases further, let us look at Figure 3.12, which presents a diagram of this pipeline with several instructions in ight. This figure uses abbreviated names for each stages to make the fi gure fit on the page: A1, A2, and M correspond to stages ALU1, ALU2, and MEM, respectively. From Figure 3.12, it should be clear that hazards can only exist between instruction i and instructions i + 1,i+ 2, and i + 3. Instruction i writes its result in WB in the same clock as instruction i + 4 reads its operands in RF. Assuming split-phase write/read of the register file, instruction i +4 will always obtain the correct value of a result produced by instruction i. With the groundwork complete, we can consider each potential data hazard outlined above in greater depth.

In our first hazard case, instruction i produces a value in ALU2 that a following ALU instruction requires in ALU2. To get around the potential hazard, we forward the result from ALU2 to future instructions in the pipeline that require the result. Instruction i + 1 can receive the value from i if it is forwarded from the WB stage of i to the inputs of ALU2 in cycle 6. Similarly, instructions i + 2 and i + 3 can receive the value from i if it is forwarded from the WB stage of i to the source latches for the appropriate register value in cycle 7.

In the second hazardous situation, instruction i produces a value in ALU2 that a following memory operation instruction requires in ALU1. If the consuming instruction is in positions i + 1ori+ 2, the pipeline must stall because forwarding to i + 1ori+ 2 requires going back in time (as i + 1 and i + 2 require the value prior to the beginning of cycle 4, but i does not produce it until the end of cycle 5). Finally, instruction i +3 can receive the value from i if it is forwarded from the WB stage of i to the inputs of ALU1 in cycle 6.

Summarizing the results of the above discussions as per Figure 3.19 in the text leads to Figure 3.13. In this gure the destination of the result can be either ALU input or either latch used to carry the value of the source registers down the pipeline. The phrase \ALU Op" represents any instruc- tion that uses ALU2 to perform a computation, and \EA Op" represents any instruction that computes an e ective address. Finally, each line in Figure 3.13 corresponds to two forwarding paths:

1. A path to the \Source 1" input that is activated when the destination register of the source instruction is the same as the source 1 register of the destination instruction.

2. A path to the \Source 2" input that is activated when the destination register of the source instruction is the same as the source 2 register of the destination instruction.

Exercise 3.3(d)

The easiest way to determine the data-forwarding requirements is to begin by asking yourself what each stage of the pipeline can produce and consume:

The IF, RF, and WB stages do not produce values that are forwarded through bypass paths.
The ALU1 stage computes an e ective address for a given memory ac- cess instruction. Such addresses are not used by any other instruction and thus, ALU1 does not need to forward its results.

Instruction	1	2	3	4	5	6	7	8	9
i	IF	RF	A1	M	A2	WB
i+1		IF	RF	A1	M	A2	WB
i+2			IF	RF	A1	M	A2	WB
i+3				IF	RF	A1	M	A2	WB
i+4					IF	RF	A1	M	...

Figure 3.14: The Pipeline.

The MEM stage accesses a value in memory. On loads, the value could be used in ALU1 as part of an e ective address computation, in MEM as a value to be stored to memory, or in ALU2 as an operand to an ALU operation of a later instruction. Note that stores do not produce a value.
The ALU2 stage computes the result of an instruction's operation. This value could be used in ALU1 as part of an e ective address computation, in MEM as a value to be stored to memory, or in ALU2 as an operand to an ALU operation of a later instruction.

Because Exercise 3.3(c) examines the forwarding paths between the two ALU stages we ignore these cases. This leaves us with four potential data hazards

1. Memory load to effective address computation (MEM to ALU1).

2. Memory load to ALU operation (MEM to ALU2).

3. Memory load to memory store (MEM to MEM).

4. ALU operation to memory store (ALU2 to MEM).

Now that we have identified where the data forwarding needs to occur, we can examine the specifics for each of these three cases. Before going on, let us look at Figure 3.14 which presents a diagram of the pipeline with several instructions in ight. This figure uses abbreviated names for each stages to squeeze the figure on the page: A1, A2, and M correspond to stages ALU1, ALU2, and MEM, respectively.

From Figure 3.14, it should be clear that hazards can only exist between instruction i and instructions i + 1,i+ 2, and i + 3. Instruction i writes its result in WB in the same clock as instruction i + 4 reads its operands in RF. Assuming split-phase write/read of the register file, instruction i +4 will always obtain the correct value of a result produced by instruction i.

With the groundwork complete, we can consider each potential data hazard outlined above in greater depth.

In the first case, a data hazard between MEM and ALU1, instruction i produces a result in MEM which is required by ALU1 of a later load instruction. If that instruction is in position i + 1 the pipeline must stall because forwarding to i+1 would require going back in time (as i+1 requires the value at the beginning of cycle 4 but i does not produce it until the end of cycle 4). Instruction i +2 can receive the value from i if it is forwarded from the MEM stage of i to the inputs of ALU1 in cycle 5. Likewise, instruction i + 3 can receive the value from i if it is forwarded from the WB stage of i to the inputs of ALU1 in cycle 6. For the second case, a data hazard between MEM and ALU2, instruction i produces a result in MEM which is required by ALU2 of a later ALU instruction. Instruction i + 1 can receive the value from MEM of i if the value is forwarded from ALU2 of i to the inputs of ALU2 in cycle 6. Similarly, instructions i +2 and i +3 can receive the value from MEM of i if the value is forwarded from the ALU2 stage of i to the source latches for the appropriate register at MEM or ALU1 in cycle 6. Note that we are forwarding from the portion of the pipeline register that passes the result of the load down the pipeline to WB.

For the third case, a data hazard between MEM and MEM, instruction i produces a result in MEM which is required by MEM of a later store instruction. Instructions i + 1ori+ 2 can receive the value from i if it is forwarded from the MEM stage of i to the source latches at MEM or ALU1 for the appropriate register value in cycle 5. Finally, Instruction i + 3 can receive the value from i if it is forwarded from the ALU2 stage of i to the source latch for the appropriate register value in cycle 6.

Finally, in the fourth case, a data hazard between ALU2 and MEM, instruction i produces a result in ALU2 which is required by MEM of a later store instruction. If that instruction is in position i + 1 the pipeline must stall because forwarding to i + 1would require going back in time (as i + 1 requires the value at the beginning og cycle 5 but i does not produce it until the end of cycle 5). Instruction i + 2 can receive the value from i if it is forwarded from the ALU2 stage of i to the inputs of MEM in cycle 6. Likewise, instruction i + 3 can receive the value from i if it is forwarded from the ALU2 stage of i to the source latch for the appropriate register at ALU1 in cycle 6.

Summarizing the results of the above discussions as per Figure 3.19 in the text leads to Figure 3.15. In this fi gure the destination of the result can be either ALU input or either latch used to carry the value of the source registers down the pipeline. The phrase \ALU Op" represents any instruc- tion that uses ALU2 to perform a computation and \EA Op" represents any instruction that computes an effective address. Finally, each line in Figure 3.15 corresponds to two forwarding paths:

1. One to the \Source 1" input activated when the destination register of the source instruction is the same as the source 1 register of the destination instruction.

2. One to the \Source 2" input activated when the destination register of the source instruction is the same as the source 2 register of the destination instruction.

These paths are not explicitly shown to clarify the gure. Also, because there may be several cycles between the times at which avalue is produced and consumed, there are multiple points in the pipeline where forwarding can occur. This observation implies that there are several ways to implement the necessary forwarding|this solution presents one possible implementation.

Exercise 3.3(e)

In Exercises 3.3(c) and 3.3(d) we have explored the data-forwarding paths required by the pipeline to prevent stalls. Unfortunately, there are several cases (identified in these exercises) where a value must be consumed before it is produced to eliminate a hazard. Forwarding can not remove such hazards. Instead, in these cases it is necessary to stall the pipeline with an interlock until the hazard clears. This exercise asks us to identify the interlocks in the pipeline and the number of stall cycles they introduce.

Rather than repeat the discussion from Exercises 3.3(c) and 3.3(d), we state the data hazards that require interlocks:

1. Load instruction i and an instruction i + 1 that computes an effective address (ALU1 requires the result from MEM).

2. ALU instruction i and instruction i + 1 that computes an effective address (ALU1 requires the result from ALU2).

3. ALU instruction i and store instruction i +1 (MEM requires the result from ALU2).

4. ALU instruction i and instruction i + 2 that computes an effective address (ALU1 requires the result from ALU2).

The labeling of instructions (i, i +1, i+2, etc.) indicates the cycle in which they enter the pipeline. These results can be derived by finding each hazard and then determining whether they can be solved with forwarding. Hazards can be found by considering the potential consumers of results produced by the various stages of the pipeline.

At this point, we present the interlock hardware and then brie y discuss howwe came up with all this information. For this processor, the interlocking hardware is described in Figure 3.16 as per Figure 3.18 of the text. The phrase \ALU Op" represents any operation that uses ALU2 to perform a computation that includes instructions such as add or compare, and \EA Op" includes any instruction that requires the computation of an effective address. The interlock is only applied if a source register of Opcode #1 matches the destination register of Opcode #2.

To understand how we arrive at Figure 3.16, let us examine the first case in greater depth. In this case, we are interested in stalling an instruction i+1 that computes an effective address, provided instruction i is a load. Now, the opcode in the IF/RF pipeline latch represents the instruction currently in RF. As this is the stage where we like to stall instructions if necessary, we check for the case where instruction i + 1 is an \EA Op" here. When instruction i + 1 is in the IF/RF pipeline register, instruction i must be in the RF/ALU1 pipeline register as it is a cycle ahead. As a result, we check for an ALU operation in the RF/ALU1 register to determine whether this particular interlock must be enforced. In addition to checking the opcodes, a check must be made to see whether the instruction issuing into the pipe (instruction i+1 in this case) uses a register written by the earlier instruction in the pipe (instruction i in this case). The remainder of the cases are determined in a similar fashion.

Exercise 3.3(f)

A branch requires a compare between two registers which takes place in the ALU2 stage of the pipeline. In this machine there is only a single control hazard following branch instructions, which requires a four cycle stall. An example of such a control hazard is shown in Figure 3.17 The fetch to the successor of the branch instruction, Branch +1, initially occurs in cycle 2 but actually does not occur until cycle 6 due to the control hazard. This exercise points out how bad control hazards can become if the pipeline is fairly deep. Essentially, this control hazard prevents four instructions from issuing into the pipeline! As a result, it is very important to resolve branches as early as possible in a pipelined processor. The number of stall cycles can be reduced by one when the fetch in cycle two retrieves the correct information if the fetch is not re-issued in cycle six.

Instruction 1 2 3 4 5 6 7 8

Branch IF RF ALU1 MEM ALU2 WB

Branch+1 IF stall stall stall IF RF ALU1

Branch+2 IF RF

Figure 3.17: A Control Hazard.

Exercise 3.10

These exercises examine a machine with a three-stage pipeline and consider under what conditions adding an additional stage to the pipeline improves performance. The three-stage pipeline consists of these three stages: Instruction Fetch, Operand Decode, and Execution or Memory Access (abbreviated IF, OD, and EM, respectively). The four-stage pipeline is built by adding a Write Back stage (abbreviated WB) to the end of the three-stage pipeline. Before presenting the solution to each exercise, we first develop an equation that is used to solve Exercises 3.10(a) and 3.10(b).

Because time is the final measure of performance, we begin by considering the equation for CPU Time from Chapter 1 of the text (as an aside, it is possible to solve these exercises using pipeline speedup; however, such solutions are more involved as you must keep CPI and clock cycle terms consistent):

CPU Time = CPI X (Clock Cycle Time) X (Instruction Count)

= CPI X Clk X IC

= (CPI ideal + CPI stalls ) X Clk X IC (3.6)

where CPI ideal is one and CPI stalls represents the CPI due to pipeline stalls.

CPIstalls is given by

CPIstalls = SUM(Penalty s X Frequency s) (3.7)

which sums over all types of stalls the product of the frequency of the stall and the stall cost (i.e., the number of penalty cycles the system must remain idle to clear the stall) in cycles.

Using Equation 3.6, we can arrive at a condition that must hold whenever the four stage pipe is a "win" performance-wise:

CPU Time3 >= CPU Time4

IC3 X (1 + CPIstall ,3 ) X Clk3 >= IC4 X (1 + CPIstall,;4 ) X Clk4

(1 + CPIstall,3 ) T >= (1 + CPIstall ,4 ) (T-d) (3.8)

where the subscripts on the various terms indicate which pipeline depth the term is associated with and the CPI for stalls is given by Equation 3.7. Because the addition of a stage does not change the number of instructions executed by the machine, the instruction count terms IC3 and IC4 are equal and can be canceled. Finally, from the exercise statement, Clk3 and Clk4have been replaced by T and T-d, respectively. In the following exercises, pipeline designs are compared by finding values for the unknowns in Equation 3.8 and the reducing the resulting expression.

Exercise 3.10(a)

This exercise asks us to consider the data hazard outlined in the exercise statement and arrive at a lower bound on the clock cycle reduction, d, which

makes moving to a four-stage pipeline profitable in terms of performance. To solve the exercise, we first find values for the CPI due to stalls in both

the three-stage and four-stage pipelines, CPIstall ,3 and CPIstall,4 , respectively. From the exercise statement, we learn that stalls can potentially

occur between instructions i and i + 1 and between instructions i and i + 2, as summarized in Figure 3.26. A hazard can not occur between instruction

i and both instructions i +1 and i +2 as the exercise states that \each result has exactly one use." The frequencies are given in the exercise statement,

and the penalties can be determined by examining how the pipeline behaves. Because a data hazard can not occur between instructions i and i +2 in the

three-stage pipeline, the value of the penalty in this case is zero. 2 As an aside, neither pipeline implements split-phase reading/writing of the regis-

ter file. Such support would require that the three-stage pipeline be able to execute and write a result in the first half of the cycle, which is not likely

to be possible for reasonable clock cycles. Because the four-stage pipeline is based upon the three-stage pipeline, we also assume that it also does not

implement split-phase reading/writing. Using the information from Figure 3.26 along with Equation 3.7 developed above for stalls leads to a CPIstall,3 of

CPIstall ,3 = 1 ( p^-1)+0(p^-2)=1/p

To find an expression for the lower bound on the clock cycle reduction, d, required to make moving from a three to a four stage pipeline profitable, we can solve for d in the above equations and end up with

d>= T/(p+1)

Exercise 3.10(b)

For this exercise, we are interested in finding the frequency of conditional branches that could exist in a program before it would run slower on the fourstage pipe than on the three stage pipe. As was the case in Exercise 3.10, wewill apply Equation 3.8 but with a few extra twists. The first twist comesin the inclusion of the effects of branch hazards. To include branch hazards, we simply add a few branch-related terms to the CPIstall terms in Equation 3.8. Figure 3.27 summarizes the relevantinformation. In this figure, the Taken/Not-Taken frequencies are given inthe exercise statement, and the stall cycles can be found by examining whatgoes on in the pipeline. Since an explicit value for the branch frequency isnot given (indeed, this is what we are after), we have called it, for lack ofa better name, b.

The second twist lies in a change to the number of stall cycles caused by data hazards on the four stage pipeline. The exercise states that a bypasspath has been provided in the four stage pipeline to eliminate only datahazards between instructions i and i + 2. Consider what this bypass pathdoes: it passes the value being written into the register file from WB ofinstruction i to EM of instruction i+2. This same path can be used to reducethe number of stall cycles cause by a data hazard between instructions i andi + 1 in the four stage pipeline. Figure 3.28 shows how the pipeline behavesduring a hazard between instructions i and i +1 both with and with out thebypass path between instructions i and i + 2. Without the bypass path the pipeline stalls two cycles to allow instructioni time to complete WB before instruction i + 1 reads its operands in OD.With the bypass path from WB and EM, it is possible to forward the resultof instruction i from WB in cycle 4 to the EM of instruction i + 1 in cycle 5after stalling only one cycle! Therefore, with respect to data hazards the fourstage pipe with bypass has the same number of stall cycles (see Figure 3.27)and thus effectively looks to be three stages. Figure 3.29 summarizes the behavior of the pipelines with respect to datahazards. The addition of the bypass path described above has made bothpipelines look the same with respect to their stall behavior. As an aside,neither pipeline implements split phase reading/writing of the register file.

Such support would require that the three-stage pipeline be able to execute and write a result in the first half of the cycle which is not likely to bepossible for reasonable clock cycles. Because the four-stage pipeline is basedupon the three stage pipeline, we also assume that it also does not implementsplit phase reading/writing.

CPI stall,3 = (1+1.6bp)/p CPI stall,4=(1+2.6bp)/p b<= .14(P+1/p)

Exercise 3.13

In this exercise, we are asked to give the forwarding logic for the FP and integer instructions for the DLX pipeline shown in Figure 3.44 assuming that there is one combined integer/FP register le. We assume that both the integer and FP pipelines can \feed" each other and thus, for example, make no attempt to decide whether a FP instruction that is feeding its result to an integer instruction makes sense or not|we simply allow it.

If multiple entries are given in a row, then that implies all possible combinations of checks are required.
Results are always forwarded to the ID/(EX,A1,M1) pipeline register based on the opcode of the destination instruction. This may or may

not require more forwarding logic depending on how these pipeline registers are implemented (because there are now multiple possible destination pipeline registers instead of only ID/EX).

Source Instruction		Destination Instruction
Pipeline Register	Opcode	Opcode	Pipeline Register
MEM/WB	FP Load	FP Mult	ID/M1
MEM/WB	FP Load	FP Add	ID/A1
A4/MEM MEM/WB	FP Add	FP Add	ID/A1
A4/MEM MEM/WB	FP Add	FP Mult	ID/M1
M7/MEM MEM/WB	FP Mult	FP Mult	ID/M1
M7/MEM MEM/WB	FP Mult	FP Add	ID/A1

Figure 3.32: DLX FP Forwarding Logic.

The source of a forwarded result comes from either the MEM/WB, EX/MEM, A4/MEM, or M7/MEM pipeline register.
Each row represents several forwarding paths depending on the opcode of the destination instruction. If the destination is a branch, store, or ALU-immediate, then it uses the top ALU input. If it is register- register, then it could use either ALU input (as shown explicitly in Figure 3.19 in the text).
Based on the opcode of the source instruction, individual rows also expand into multiple row entries based on whether the instruction were an ALU-immediate or register-register ALU instructions because the target register bit fields for these instruction types are different.

Figure 3.33 presents the forwarding logic for this exercise. In this fi gure it is understood that the result is forwarded if the destination register of the source instruction is the same as a source register used by the destination instruction. By combining the integer and FP register in to one register set, this requires that FP pipelines check the integer outputs and that the integer pipelines check the FP pipeline outputs. This means more forwarding paths and more checks in order to implement forwarding

Source Instruction Destination Instruction

Pipeline Opcode Opcode

EX/MEM Integer ALU (not loads) Any

MEM/WB Integer ALU Any

MEM/WB Loads Any

A4/MEM FP Add Any

MEM/WB FP Add Any

M7/MEM FP Multiply Any

MEM/WB FP Multiply Any

Figure 3.33: Forwarding Logic for a Version of DLX with a Combined FP and Integer Register Set.

Exercise 4.1

There are seven dependences in the C loop presented in the exercise: 1. True dependence from S1 to S2 on a.

2. Loop-carried true dependence from S4 to S1 on b.

3. Loop-carried true dependence from S4 to S3 on b.

4. Loop-carried true dependence from S4 to S4 on b.

5. Loop-carried output dependence from S1 to S3 on a.

6. Loop-carried antidependence from S1 to S3 on a.

7. Loop-carried antidependence from S2 to S3 on a.

For a loop to be parallel, each iteration must be independent of all others, which is not the case in the code used for this exercise.

Because dependences 1, 2, 3, and 4 are \true" dependences, they can not be removed through renaming. In addition, as dependences 2, 3, and 4 are loopcarried, they imply that iterations of the loop are not independent. These factors together imply the loop can not be made parallel as the loop is written. By \rewriting" the loop it may be possible to fi nd a loop that is functionally equivalent to the original loop that can be made parallel. Exercise 4.2 provides an example of such a situation on a different loop.

Exercise 4.4

In this exercise, we are asked to unroll the loop and schedule it for a pipelined version of DLX. We assume that the loop was originally executed a non-zero, even number of times, otherwise, more sophisticated transformations would be required. We have also scheduled the branch-delay slot.

When unrolling a loop with no loop-carried dependences, one can follow some basic guidelines. First, copy all the statements in the original loop and put them after the original loop statements. Second, rename all the registers in the copied instructions so that they are distinct from the original statements (this can be done by adding a fixed value to each register number, assuming there are enough registers available). Third, interleave all of the statements by putting the i'th instruction in the group of copied instructions right after the i'th instruction in the original sequence. These steps yield a schedule without violating data dependences.

One can then remove loopoverhead instructions and rearrange other instructions as necessary to cover pipeline latencies. Instructions that use or update index calculations will have to be updated based on reordering of instructions or elimination of intermediate index updates.

Doing these steps and reordering instructions to cover any remaining latencies, yields the code shown in Figure 4.2. In this code, the comments indicate the amount of latency and the instruction from which the latency is measured (e.g., \>1 from LD F0,0(R1)" indicates the instruction must follow the specifi ed load by more than one cycle).

loop: LD F0, 0(R1)

LD F6, -8(R1)

MULTD F0, F0, F2 ; >1 from LD F0

MULTD F6, F6, F2 ; >1 from LD F6

LD F4, 0(R2)

LD F8, -8(R2)

ADDD F0, F0, F4 ; >3 from MULTD F0

ADDD F6, F6, F8 ; >3 from MULTD F6

SUBI R1, R1, 16

SUBI R2, R2, 16

SD 8(R2), F0 ; >2 from ADDD F0

BNEZ R1, loop

SD 0(R2), F6 ; >2 from ADDD F6; fills branch delay

Figure 4.2: Code Unrolled Once and Scheduled.

Exercise 4.7

There are many code sequences that stall a scoreboard-based system yet do not stall a Tomasulo-based system. Such code sequences contain two instructions that have the following features: first, the two instructions attempt to read their source registers in the same cycle, and second, the two instructions utilize the same group of functional units in the scoreboard system. For example, consider the DLX FP code fragment shown in Figure 4.6. In this code fragment, the addd issues and computes the value of f0. During the period the addd is computing the value of f0, two multd instructions issue. The multiplies can both issue because both the scoreboard and Tomasulo architectures developed in Section 4.2 of the text have two FPmultipliers (see Figures 4.2 and 4.8) and the latency of the addd specified by the exercise is greater than the time to issue both multiply instructions. Although the multiplies issue, they depend on the value of f0 produced by the addd

qux: addd f0, f2, f4 ; f0 =f2 + f4

multd f6, f0, f10 ; sinks f0 from addd

multd f8, f0, f10 ; sinks f0 from addd

Figure 4.6: Code Fragment which Stalls a Scoreboard Machine but not a Tomasulo Machine.

and therefore stall after issuing until the addd completes. When the addd completes, things get interesting. In the scoreboard system the two floating point multipliers share their input buses (see Figure 4.3 in the text). Therefore, the multipliers must serialize their accesses to f0 because only one multiply can read its operands from the register file at a time. This implies that the second multiply stalls another cycle (in addition to the cycles it has stalled waiting for the addd to complete) because it must wait for the first multiply to finish reading its operands before it can proceed.

In the case of Tomasulo's algorithm, when the addd completes its result is placed on the CDB where any reservation station that holds an instruction sinking the value can retrieve it (see Figure 4.8 in the text). This allows the two multiplies to read their operands in parallel.

Exercise 4.11

This exercise is solved by comparing the CPIs both with and without a branch-target bu er (BTB) that folds unconditional branches, CPIfold and

CPIno fold (we might also approach this exercise by fi guring out the speedup of the pipeline speedups with and without the branch-folding BTB|the resulting speedup equation, when simplified, is the identical to the equation for the speedup of the CPIs). This speedup is given by

Speedup = CPI no fold/CPIfold

= CPI no fold/CPI base + Stalls fold

From the exercise statement, CPIno fold is 1:1 (this value accounts for the stalls in this case, thus there is no stall term in Equation 4.4 for the no fold case) and CPIbase is 1:0. By defi nition, the base CPI accounts for everything

except unconditional branches. To complete the solution, we must nd the number of stall cycles that are caused by folding unconditional branches.

To nd Stallsfold we begin with the following expression:

Stalls = Sum(Frequency s X Penalty s)

which essentially sums over all stall cases related to branch folding the prod- uct of the frequency of the stall case and the penalty. To compute a value for Stallsfold , we first consider what goes on in the DLX pipeline when an unconditional branch-folding BTB is present.

Figure 4.10 illustrates how the pipeline behaves when an unconditional branch is found in the BTB. When the unconditional branch is in the BTB, then the target instruction is retrieved from the BTB and passed along to the ID stage in cycle two (note that it is implied in the discussion of branch folding in the text that the BTB can return the target instruction in time to be passed along to ID in cycle two). The net e ect of this is just as if the fetch

Instruction Clock Cycle 1 2 3 4 5 6

Branch IF

i + 1 *ID* EX MEM WB

i + 2 IF ID EX MEM . ..

i+ 3 IF ID EX . ..

Figure 4.10: Effect on the DLX Pipeline of a BTB Hit.

Instruction Clock Cycle 1 2 3 4 5 6

Branch IF ID EX MEM WB

i + 1 IF IF ID EX . . .

i+ 2 stall ID EX . . .

Figure 4.11: Effect on the DLX Pipeline of a BTB Miss.

in cycle one had fetched the target of the branch rather than the branch itself. This implies there is a one cycle negative penalty associated with BTB hits! The reason for this lies in the fact that the BTB eliminates unconditional branches from the instruction stream - the unconditional branch never gets past the IF stage of the pipe and therefore the fetch of instructions i + 2,

i+3, . . . occur one cycle earlier than they could have. Figure 4.11 illustrates what occurs when an unconditional branch instruction can not be found in the BTB (i.e., a BTB miss occurs). Because the branch instruction is not present in the BTB, the fetch issued in cycle two goes to the instruction following the branch in the code rather than the instruction at the target of the branch. By the end of cycle two, the ID stage determines the outcome of the unconditional branch and re-issues the fetch of instruction i + 1 in cycle three. This creates a one cycle penalty in the pipeline since the fetches for instructions i + 2,i+3, . . . are delayed one cycle.

Now that we know how many stall cycles can be caused by the BTB, we can compute the frequencies with which the stalls occur. From the exercise statement, 5% of all instructions executed by the DLX are unconditional branches. Of these 5%, 90% will hit the BTB and 10% will miss. Figure 4.12 summarizes the information we have up to this point.

BTB Result

Hit 5% X 90% = 4:5% = -1 penalty cycles

Miss 5% X 10% = .5% = 1 penalty cycle

Stalls fold = (4.5%X-1)+(.5%X1) = -0.04

Speedup = 1.1/(1.0+(-0.04)) = 1.15

Adding a BTB that folds unconditional branches makes the DLX pipeline about 15% faster.

Instruction	1	2	3	4	5	6	7	8
Branch	IF	RF	ALU1	MEM	ALU2	WB
Branch+1		IF	stall	stall	stall	IF	RF	ALU1
Branch+2							IF	RF