Possible options for GPGPUSim Projects

Credit: Hodjat Asghari-Esfeden came up with the project ideas.

Effect of warp schedulers on the collector units occupancy Default Fermi-like configuration in GPGPU-Sim supports two different types of warp schedulers:

GTO: Greedy Then Oldest

LRR: Loose Round Robin

The default configuration has 16 collector units per each SM. However, there might be a different number of active collector units when we use different warp schedulers. The goal of this project is to profile number of active collector units while using different warp schedulers.

Changing the Thread Block Scheduler: Currently, there is a round-robin policy for assigning thread blocks to the SMs. We want to change the thread block scheduler in a way that it assigns thread blocks to the same SM as much as possible, then move on to the next SM. Report the performance impact as well as the number of available SMs in both cases.

IDLEness in computational cores: In this project, we are trying to figure out the utilization of computational cores during the execution of different applications. In the current architecture of Nvidia GPUs, there is a dedicated number of computational cores for different back-end pipelines (for example, Fermi has 2 SP functional units, 1 SFU, and 1 Mem). We are looking to see how utilized are these computational cores during kernel execution.

Find the importance of different stages of the pipeline on the overall performance: Currently, GPGPU-Sim models the GPU pipeline in 6 stages: fetch, decode, issue, operand collection, and write-back. We’re going to look at these different stages, and see the instruction execution breakdown, i.e., in what fraction of total execution time, instruction is being fetched, decoded, issued, and so on. The goal of this project is to figure out the importance of different stages during instruction execution.

Impact of Instruction Buffer size on the overall performance: Current version of GPGPU-Sim (version 3.2.x) implements the instruction buffer as a structure which reserves 2 entries per each warp, e.g., for Fermi configuration, since we may have up to 48 inflight warps, instruction buffer is like a table which has 48 rows (1 row per warp), and each row has two entries (one entry per each instruction). Instructions are loaded from the instruction cache once a row becomes empty. The goal of this project is to check the impact of the size of instruction buffer (number of instructions per each warp) on the performance. You are asked to widen the instruction buffer in a way that it allocates more slots (more than two instructions, say four instructions) to each warp. Report the performance impact.

Your report should detail your idea, present your design (if any) and details of implementation, summarize your results, and present conclusions. The project will be graded based on a one-to-one demo to the instructor, as well as the proposal and report. The project will be due middle of the final week when the grading interviews will start. If you need to be graded earlier (for example due to travel plans), I can work with your schedule.