Architecture discussion starts with the ISA.  For NVIDIA, PTX is a
virtual ISA (its not the real ISA, but its close; for example, it
allows the use of unlimited registers).  NVCC compiles it to the true
ISA as it produces binary (this is SASS).  Different binaries,
corresponding to different compute capabilities, are included in the
executable.  When the program executes, it picks the right version of
SASS for the GPU platform, or uses just in time compilation of the PTX
if the platform is new.

Look at slides for examples of PTX.  Its a RISC instruction set.
Notice the use of predication for conditional branches.

GPGPU Architecture: The front end of an SIMT core (in green on slides)
implements the fetch, decode, and schedule for execution components of
the GPU architecture -- it multiplexes the different warps on top of
the available SIMD data path.

Fetch Decode Stage:
------------------------------
The pipeline starts with the fetch/decode stage.
Instructions are scheduled from an instruction buffer (I-Buffer) which
has a fixed number of instructions lots for every warp.  So, we can
only bring more instructions to the I-Buffer for a given warp if it
has empty slots in there.  

The I-Buffer has a valid bit which if set indicates that there is a
valid instruction in the slot.  It also has another bit r, which
indicates whether the instruction is waiting on a dependency from the
scoreboard (explained soon). 

Now, for a warp to have an available slot in the I-Buffer, and
therefore to be a candidate to fetch more instructions, it should have
at least one slot empty (v bit is 0).  All warps with availability in
the I-buffer, are candidates for fetching instructions.  The v-bits
are communicated to the fetch arbitration logic (see the arrow marked
"to fetch" going left from the I-Buffer, which is also the arrow
marked Valid[1:N] going into the fetch logic.

The first scheduler then looks at the warps that have room based on
the valid bits, and selects one of them to fetch instructions for from
the I-Cache.  If the instructions are in the I-Cache, the are fetched,
decoded and placed in the I-Buffer.  If not, we issue the fetch of the
instruction from main memory and continue operation (we don't wait for
it).  The Warp will later be scheduled again at which point we will
have an I-Cache hit.

Instruction Issue:
------------------

Note that a warp is eligible to issue an instruction if it has a valid
and ready (according to the scoreboard) in the I-Buffer.  Remember
that this is an in-order pipeline, so we are talking about the first
instruction in order being valid and ready; we don't issue
instructions out of order.  Here the second scheduler (the warp
scheduler) comes into play to decide which warp to run next.
Schedulers such as Greedy-then-oldest (GTO) and odd/even scheduling
are used here.  We will talk about this issue in more detail later.

Scoreboard:


GPUs work in-order and issue one or two instruction at a time (the
odd-even scheduler can issue two instructions at a time, but from
different warps).  Modern GPUs have multiple warp schedulers and can
issue even more (again from different warps)

Scoreboarding keeps track of dependencies for instructions from the
same warp.  Instructions from a warp get issued one instruction in
each cycle that the scheduler selects the warp to execute.  However,
since the instruction execution can take multiple cycles, two
instructions can be active at the same time.  Scoreboarding keeps
track of dependencies to make sure we do not allow an instruction to
start executing if there is a dependency with a previous instruction
that is still executing.

The ready bit is set by the hardware scoreboard (slides 22 to 24).  To
simplify, a scoreboard can simply track the state of each register for
each warp.  If the warp issues an instruction whose destination is a
register, we mark that register to say that the value in it is not
available.  This way, if an instruction comes later that needs this
value, it knows to wait until the value is generated by the
instruction that set this bit.

You can see that this simple approach takes care of both Read After
Write (RAW) and Write After Write (WAW) dependencies.  To illustrate, here is an example:

add r3, r2, r1    // r3 = r2+r1
sub r5, r3, r4    // RAW
add r5, r2, r1    // WAW

After the first instruction issues, we mark r3 as unavailable.  When
the sub instruction arrives, it cannot issue since r3 is not ready.
After the first instruction completes, sub now can read the new value
of r3 and issue, marking r5 which is the destination register as
unavailable.  The third instruction cannot issue since it writes to r5
(WAW).


(What do we do about WAR dependencies?)

Not a problem in this case because we issue in order: the read has
already read the register values when we issue the write after it.

Implementation of scoreboard:

Its very expensive to implement the scoreboard as described above.  We
would need a bit for every register for every warp.  They have to be
updated as we operate.

So, we take a short cut.  We track only a limited number of
dependencies for every warp -- lets say 6.  If we run out of slots, we
can delay issue of ready instructions that need dependency slots.  6
was chosen to balance overhead against the likelihood of blocking
instructions because we don't have enough space in the scoreboard.

We need to add a scoreboard table that identifies which register each
of the 6 bits are tracking.  See the example on slides to see how
the implementation works.

Supporting Control Divergence
------------------------------
A critical part of the SIMT model (also the SIMD model!) is how to
enable threads to carry out conditional statements based on local data
(which could be different for different threads).  This is the
behavior that leads to thread divergence.  The support for this
behavior is provided using the SIMT-stack.  This stack sends the
target PC to the fetch unit (to control which instruction is fetched
next) and the active mask to the issue unit (to control which lanes of
the warp are active).


 When we hit a conditional branch, the branch evaluates differently
based on the taken/not-taken condition.  Thread divergence is
supported by keeping track of a mask of the active threads for every
conditional component. The mask is a bit vector with 1 for every
thread that is active for the corresponding control flow branch.  When
that control flow branch is being executed, only the threads with 1 in
the corresponding branch bit execute those instructions.

To minimize the impact of control divergence, threads must reconverge
at the earliest possible point when the divergence is done (even
though it is possible to reconverge later, or even at the end of the
program).  This called the post-dominator in control flow
graph/compiler terminology where the active mask at the post-dominator
dominates the active mask of all the control flow paths that meet at a
post-dominator.  In other words, the post-dominator active mask has
1 for every thread that is active in each of the divergent paths that
reconverge at that point.

To support nested divergence and reconvergence, the SIMT stack
implementation is explained on slide 25.  When we hit a divergent
point, we push on the stack: (1) the current active mask and the next
PC at the reconverge point; (2) the active mask, PC, and reconverge PC
for every branch.  We start executing one of them with the specified
active mask, until we reach the reconverge PC.  At that point, this
branch is done and is popped off the stack.  We set the active mask to
the next branch (the top of the stack) and execute from the next PC
until that too hits the reconverge PC.  Once we are done with all the
branches, the stack has only the original active mask and the PC at
the reconverge point -- we are converged!

In practice, this is implemented using predication.  We went through
an example.  Predication support in the
instruction set works as follows:

If an instruction is preceded by @P it means that execution is
predicated on the value of the predicate register P (i.e., it only
occurs if the corresponding bit in P is true).  P is a register with
one bit corresponding to every thread. Typically p is set by a
previous setp instruction which sets its values based on the specified
comparison result at every thread.  For example, the setp.neq.s32
instruction compares RD0 and the immediate 0 and sets P1 if they are
not equal.

If an instruction is ended by a *OP operation, this is an
unconditional operation on an active mask/active mask stack.  *Push,
pushes on the stack, *Comp complements mask bits, and *Pop pops a
value from the stack.  This is a simplified example, and I dont expect
you to know the details of the implementation -- its enough to
understand the conceptual implementation on slide 25.

Register File:
--------------

Next, we discussed the register file.  Once instructions are ready to
issue they must get their operands from the register file.  The
register file is very large and we need to at least make it 4-ported
if we want to be able to read three operands (as required by multiple
and add) and write one in one cycle.  Also, the port is very wide
since it needs to get 32 register values at a time.  This is very
expensive for a register file of this side.

So, to simplify the design, we implement the register file as a
multi-banked register, with each bank being single-ported.  If we are
lucky and the four operands go to different banks, we get the effect
of a multi-ported register.  However, if the register operands map to
the same bank, the accesses have to be serialized.

The serialization has two implications.  First, even when an
instruction is ready to issue, the operands may not be available
immediately.  So, they have to stay in place until the operands have
been read by the operand collector logic.  This is a concept similar
to reservation stations in Tomasulo's algorithm for scheduling
insruction execution in out-of-order processors.  Second, access
serialization can slow execution time.  If we allow accesses from
different warps to go on together, and carefully schedule them so that
accesses to different banks can go on together, performance is
significantly improved.  This scheduling is also done by the operand
collector.  Slide 35 shows an example.

AMD Southern Islands:
---------------------

Finally, we talked about an optimization that AMD implements in their
pipeline.  The main idea is that there are sometimes scalar variables
that are used in device code.  If these scalars are executed by the
warps, very low utilization results.  So, they have a scalar ALU as
part of their SIMT core, with its own instructions to improve the
efficiency of handling scalars.

This concludes the basic architecture overview.  In the next note, we
will start discussing research directions.