We discussed a number of active research directions in GPGPU architecture.

The first direction is supporting control divergence.  We started
by looking at how applications perform
when we reduce the warp size to 4, without changing the number of
available ALUs.  As you can expect, some applications perform better
--these are control divergent applications that now, for some
applications, would have less divergence.  Handling control divergence
can significantly improve performance.  Surprisingly, some perform
worse.  This occurs due to the loss of memory coalescing in some
applications.  This is also an important observartion -- whatever we
do should not come at the expense of exacerbating other bottlenecks.

The first solution we looked at is Dynamic Warp Scheduling (DWS) from
Micro 2007.  At a high level, the idea is to exploit the idle GPU
lanes that occur when there is control divergence by letting threads
from other warps (but in the same kernel and that have the same
Program counter) use these idle lanes.  One limitation is that we
cannot change the GPU lane: On slide 14, threads 2 and 3 (in lanes 2
and 3 respetively) can be scheduled with threads 5 and 8 (from another
warp) since threads 6 and 7 are inactive for this control flow branch.
Threads 2 and 3 use the same lanes as threads 6 and 7.

With DWS, we manage to increase ALU utilization and reduce the number
of instructions that are needed to complete the branch divergent part
of the code.

However, three unexpected problems happen.  First, since the scheduler
favors the largest group of threads with the same PC, starvation can
occur for a small group of threads always delayed in favor of a larger
group.  Eventually, this harms performance, for example when we hit a
barrier and have to wait for the starved threads to catch up,
increasing run time.

Second, DWS can lead to loss of memory caolescing even in code
segments that do not have branch divergence.  See slide 16 for an
example.  Finally, DWS can break implicit synchrony within a warp as
the threads of a warp are no longer guaranteed to execute in lock-step
(i.e., the same instruction at a time).  See slide 17 for an example
of code that gets broken by DWS.

Thread block compaction solves (rather improves) these issues by
forcing a reconvergence after a branch divergent code segment.
Starvation is reduced since the barrier at the end of the divergent
code segment forces all threads to catch up.  Since convergent code is
executed without DWS/compaction, memory coalescing is preserved.
Finally, the threads are lockstep in convergent code segments.

We discussed other solutions briefly (up to slide 30).

The second research direction is warp scheduling, primarily to improve
memory locality but for other purposes as well (Criticality).  The
main insight is that the scheduler by determining which warps execute
influences the memory access pattern presented to memory.  Having more
warps to schedule from improves core utilization -- when one or more
warps block waiting for memory or synchronization, others are
available to keep the GPU working.  However, hiven the severely
limited cache space (in this case, we assume L1, but this applies to
L2 and lower), too many active warps can cause cache misses, and
reduce performance.

The baseline schedulers are loose round robin (RR among the available warps)
and greedy-then-oldest (keep scheduling from one warp until it stalls
due to a cache miss or other event, and then switch to the oldest
remaining warp).  GTO favors scheduling from just enough warps to hide
memory latency.  There are also two level variants of both of these
policies where the scheduler schedules only from one level.
Occasionally, warps in the second level are added to the first level
and start getting scheduled.

CCWS (slide 40), explicitly takes the size of the cache into account.
It works by estimating the amount of cache space each warp needs, and
schedules warps such that their collective need does not exceed the
cache space.  This is a principle behind several schedulers that we
looked at.  CCWS estimates the cache needs by keeping a small cache of
recent victims from each warp.  If a warp experiences a miss, and the
address matches a tag in the victim cache, that means the miss was due
to a cache line that was removed recently and while it is still
active.  The system increases its estimate of the cache space needed
by that warp.w At any given time, we schedule warps such that their
collective cache requirement does not exceed the cache space.

Two variations on this general idea: static wavefront limiting,
profiles the execution to estimate the cache need for each warp and
provides that estimate to the scheduler.  Limitations of this approach
(and profiling in general) include in addition to the need to profile,
the fact that it cannot account for input dependent behavior, or track
warps whose usage varies over time.

The second variant is Divergence Aware Warp Scheduling (DAWS).  DAWS
tries to have the compiler estimate divergence and then maps the
divergence to estimate cache size requirements for each warp.  DAWS
focuses on loops, and then classifies memory accesses within loops as
divergent or non-divergent.  For divergent loads, it estimates the
control divergence and the size of the active mask, from which it
estimates the number of cache lines needed for that load (one cache
line for every active thread).  Thus, it replaces the reactive
component of CCWS, or the profile driven estimate of static wavefront
profiling, with a more intelligent and proactive estimate from the
compiler.

We also quickly overviewed some other schedulers.  Priority based
cache allocation (slide 50) observes that limiting the number of warps
by the L1 capacity leaves most of L2 unused.  So, it extends CCWS by
another level where those warps are scheduled but not cached in L1.
The benefit of having these extra warps active is that they increase
utilization if the other warps stall, and keeps more warps moving
reducing the impact of barriers (Where we have to wait for warps that
have not been scheduled to catch up).

On slide 51, we talked about criticality aware scheduling, where we
estimate the critical path of the computation and give more resources
(Scheduling slots and cache space) to the warps on the critical path.

From here we very briefly discussed two other research direction:
(1) Coherent CPU-GPU memory abstractions (which we deferred); and (2) Synchronization support
and Transactional memory.

The final topic we discussed was power efficiency for GPGPUs.  We did
this using the Warped Gates paper presentation from Micro 2013.  We
started with a quick introduction into the basics of energy
consumption on chips: (1) static or leakage power which is power
consumed all the time from having structures such as execution units
or caches be active all the time so they are ready to do work when we
need them, and (2) dynamic power which is power that is incurred when
we change the states of gates, latches and flip flops: this power is
consumed when you use structures, for example an execution unit doing
an operation, or a register file being accessed.

Techniques to control leakage power are different from those for
dynamic power.  For leakage power, a lot of the architecture
techniques center around tunring off parts of the architecture that we
are not using.  Two of the standard techniques are power gating (turn
off the power on them) or clock gating (remove the clock signal from
reaching them).  With respect to dynamic power, techniques include
lowering voltage or frequency, or trying to consolidate operations so
that they are done in fewer steps (e.g., memory coalescing would also
help dynamic power of the memory system and the caches).

This paper's goal is to reduce the leakage power of the execution
units on the GPU.  They observe that these execution units are a major
component of the GPU (68% of the area) and consume a substantial
amount of the chip power.  The main tool they want to use is power
gating the execution units (turning them off) if they are idle.
However, there is a catch: there is a cost to turn off a structure and
to turn it on again (slide 4).  As a result, if we turn off a
structure, but need to turn it on again quickly, we end up losing
energy due to this extra overhead.  This leads to the idea of a break
even idle time: the minimum idle time that needs to happen for us to
save enough energy to make up for the cost of turning on and turning
off the execution units.  This is around 9-24 cycles in their design
according to the authors.

From a practical perspective, we also need an algorithm to detect
idleness.  They simply detect an idle period if there are a few idle
cycles in a row (e.g., they assume 5 cycles for idle detect in the
standard design).

The paper profiles different GPU benchmarks (slide 6) and finds: (1)
Blue region: many idle periods are short, below the detection
threshold -- missed opportunity; (2) Red region: 10% of the idle
periods are longer than idle, but shorter than break even -- we lose
power; and (3) Green region: we save only on a modest number of idle
periods.  The rest of the paper tries to come up with ways to reduce
the red region (avoid power gating when not useful), and to increase
the green region (increase the opportunity for useful power gating).

The first technique (GATES) was to change the scheduler to group
together instructions headed to the same execution units.  This way,
integer instructions would happen in a burst, followed by a long idle
period where we can turn off the integer units, for example.  Using
this idea, slide 11 shows that we substantially increase the green
region (3x), but we also increase the bad red region by 2x!  The
reason is that we are now coalescing some of the shorter periods so
that they are longer, but not long enough to exceed the break even
period.

The next idea (Black out) they introduce is blackout: they refuse to issue
instructions to an execution unit that has been turned off unless its
idle time has exceeded the break even threshold.  This pushes the red
region into the green region, but potentially increases the execution
time since some instructions have to wait for the execution unit to be
turned on.  They argue that this is not always a loss, since we may
have other warps with ready instructions that can go to active
execution units to use the time.

The third idea (Coordinated Black out) appears to introduce two new
factors: (1) The ability to turn off half of the execution units at a
time; and (2) coordination between the GATES scheduler and the
blackout controller to help fill in the blackout gaps.  Despite these
improvements coordinated blackout sill leads to around 10% loss in
performance (slide 17).  In fairness, we should introduce a metric
that combines performance and power to judge these ideas since usually
gaining energy efficiency requires sacrificing some performance.

 Interestingly, it seems that the loss is much worse for some
applications than others (in fact, a few applications such as sgemm
benefit). To address this issue, they observe that the idle period
plays a big role in the performance of some applications, and
introduce their final idea of adapting the idle period (if too many
useless wakeups, they increase it), leading to their final scheme
which they call warped gates.  Leakage energy saving is about 1.5x,
and performance loss is around 5%.  Complexity is low (around 1%
increase in the chip area).