We discussed a number of active research directions in GPGPU architecture. The first direction is supporting control divergence. We started by looking at how applications perform when we reduce the warp size to 4, without changing the number of available ALUs. As you can expect, some applications perform better --these are control divergent applications that now, for some applications, would have less divergence. Handling control divergence can significantly improve performance. Surprisingly, some perform worse. This occurs due to the loss of memory coalescing in some applications. This is also an important observartion -- whatever we do should not come at the expense of exacerbating other bottlenecks. The first solution we looked at is Dynamic Warp Scheduling (DWS) from Micro 2007. At a high level, the idea is to exploit the idle GPU lanes that occur when there is control divergence by letting threads from other warps (but in the same kernel and that have the same Program counter) use these idle lanes. One limitation is that we cannot change the GPU lane: On slide 14, threads 2 and 3 (in lanes 2 and 3 respetively) can be scheduled with threads 5 and 8 (from another warp) since threads 6 and 7 are inactive for this control flow branch. Threads 2 and 3 use the same lanes as threads 6 and 7. With DWS, we manage to increase ALU utilization and reduce the number of instructions that are needed to complete the branch divergent part of the code. However, three unexpected problems happen. First, since the scheduler favors the largest group of threads with the same PC, starvation can occur for a small group of threads always delayed in favor of a larger group. Eventually, this harms performance, for example when we hit a barrier and have to wait for the starved threads to catch up, increasing run time. Second, DWS can lead to loss of memory caolescing even in code segments that do not have branch divergence. See slide 16 for an example. Finally, DWS can break implicit synchrony within a warp as the threads of a warp are no longer guaranteed to execute in lock-step (i.e., the same instruction at a time). See slide 17 for an example of code that gets broken by DWS. Thread block compaction solves (rather improves) these issues by forcing a reconvergence after a branch divergent code segment. Starvation is reduced since the barrier at the end of the divergent code segment forces all threads to catch up. Since convergent code is executed without DWS/compaction, memory coalescing is preserved. Finally, the threads are lockstep in convergent code segments. We discussed other solutions briefly (up to slide 30). The second research direction is warp scheduling, primarily to improve memory locality but for other purposes as well (Criticality). The main insight is that the scheduler by determining which warps execute influences the memory access pattern presented to memory. Having more warps to schedule from improves core utilization -- when one or more warps block waiting for memory or synchronization, others are available to keep the GPU working. However, hiven the severely limited cache space (in this case, we assume L1, but this applies to L2 and lower), too many active warps can cause cache misses, and reduce performance. The baseline schedulers are loose round robin (RR among the available warps) and greedy-then-oldest (keep scheduling from one warp until it stalls due to a cache miss or other event, and then switch to the oldest remaining warp). GTO favors scheduling from just enough warps to hide memory latency. There are also two level variants of both of these policies where the scheduler schedules only from one level. Occasionally, warps in the second level are added to the first level and start getting scheduled. CCWS (slide 40), explicitly takes the size of the cache into account. It works by estimating the amount of cache space each warp needs, and schedules warps such that their collective need does not exceed the cache space. This is a principle behind several schedulers that we looked at. CCWS estimates the cache needs by keeping a small cache of recent victims from each warp. If a warp experiences a miss, and the address matches a tag in the victim cache, that means the miss was due to a cache line that was removed recently and while it is still active. The system increases its estimate of the cache space needed by that warp.w At any given time, we schedule warps such that their collective cache requirement does not exceed the cache space. Two variations on this general idea: static wavefront limiting, profiles the execution to estimate the cache need for each warp and provides that estimate to the scheduler. Limitations of this approach (and profiling in general) include in addition to the need to profile, the fact that it cannot account for input dependent behavior, or track warps whose usage varies over time. The second variant is Divergence Aware Warp Scheduling (DAWS). DAWS tries to have the compiler estimate divergence and then maps the divergence to estimate cache size requirements for each warp. DAWS focuses on loops, and then classifies memory accesses within loops as divergent or non-divergent. For divergent loads, it estimates the control divergence and the size of the active mask, from which it estimates the number of cache lines needed for that load (one cache line for every active thread). Thus, it replaces the reactive component of CCWS, or the profile driven estimate of static wavefront profiling, with a more intelligent and proactive estimate from the compiler. We also quickly overviewed some other schedulers. Priority based cache allocation (slide 50) observes that limiting the number of warps by the L1 capacity leaves most of L2 unused. So, it extends CCWS by another level where those warps are scheduled but not cached in L1. The benefit of having these extra warps active is that they increase utilization if the other warps stall, and keeps more warps moving reducing the impact of barriers (Where we have to wait for warps that have not been scheduled to catch up). On slide 51, we talked about criticality aware scheduling, where we estimate the critical path of the computation and give more resources (Scheduling slots and cache space) to the warps on the critical path. From here we very briefly discussed two other research direction: (1) Coherent CPU-GPU memory abstractions (which we deferred); and (2) Synchronization support and Transactional memory. The final topic we discussed was power efficiency for GPGPUs. We did this using the Warped Gates paper presentation from Micro 2013. We started with a quick introduction into the basics of energy consumption on chips: (1) static or leakage power which is power consumed all the time from having structures such as execution units or caches be active all the time so they are ready to do work when we need them, and (2) dynamic power which is power that is incurred when we change the states of gates, latches and flip flops: this power is consumed when you use structures, for example an execution unit doing an operation, or a register file being accessed. Techniques to control leakage power are different from those for dynamic power. For leakage power, a lot of the architecture techniques center around tunring off parts of the architecture that we are not using. Two of the standard techniques are power gating (turn off the power on them) or clock gating (remove the clock signal from reaching them). With respect to dynamic power, techniques include lowering voltage or frequency, or trying to consolidate operations so that they are done in fewer steps (e.g., memory coalescing would also help dynamic power of the memory system and the caches). This paper's goal is to reduce the leakage power of the execution units on the GPU. They observe that these execution units are a major component of the GPU (68% of the area) and consume a substantial amount of the chip power. The main tool they want to use is power gating the execution units (turning them off) if they are idle. However, there is a catch: there is a cost to turn off a structure and to turn it on again (slide 4). As a result, if we turn off a structure, but need to turn it on again quickly, we end up losing energy due to this extra overhead. This leads to the idea of a break even idle time: the minimum idle time that needs to happen for us to save enough energy to make up for the cost of turning on and turning off the execution units. This is around 9-24 cycles in their design according to the authors. From a practical perspective, we also need an algorithm to detect idleness. They simply detect an idle period if there are a few idle cycles in a row (e.g., they assume 5 cycles for idle detect in the standard design). The paper profiles different GPU benchmarks (slide 6) and finds: (1) Blue region: many idle periods are short, below the detection threshold -- missed opportunity; (2) Red region: 10% of the idle periods are longer than idle, but shorter than break even -- we lose power; and (3) Green region: we save only on a modest number of idle periods. The rest of the paper tries to come up with ways to reduce the red region (avoid power gating when not useful), and to increase the green region (increase the opportunity for useful power gating). The first technique (GATES) was to change the scheduler to group together instructions headed to the same execution units. This way, integer instructions would happen in a burst, followed by a long idle period where we can turn off the integer units, for example. Using this idea, slide 11 shows that we substantially increase the green region (3x), but we also increase the bad red region by 2x! The reason is that we are now coalescing some of the shorter periods so that they are longer, but not long enough to exceed the break even period. The next idea (Black out) they introduce is blackout: they refuse to issue instructions to an execution unit that has been turned off unless its idle time has exceeded the break even threshold. This pushes the red region into the green region, but potentially increases the execution time since some instructions have to wait for the execution unit to be turned on. They argue that this is not always a loss, since we may have other warps with ready instructions that can go to active execution units to use the time. The third idea (Coordinated Black out) appears to introduce two new factors: (1) The ability to turn off half of the execution units at a time; and (2) coordination between the GATES scheduler and the blackout controller to help fill in the blackout gaps. Despite these improvements coordinated blackout sill leads to around 10% loss in performance (slide 17). In fairness, we should introduce a metric that combines performance and power to judge these ideas since usually gaining energy efficiency requires sacrificing some performance. Interestingly, it seems that the loss is much worse for some applications than others (in fact, a few applications such as sgemm benefit). To address this issue, they observe that the idle period plays a big role in the performance of some applications, and introduce their final idea of adapting the idle period (if too many useless wakeups, they increase it), leading to their final scheme which they call warped gates. Leakage energy saving is about 1.5x, and performance loss is around 5%. Complexity is low (around 1% increase in the chip area).