Warp Processors
Warp Processing Press Release (Oct 2007)
Warp processors dynamically optimize their software to improve
execution time and energy consumption. By performing optimizations at
runtime, Warp processors have the advatanges of eliminating tool flow
restrictions and extra designer effort associated with traditional
compile-time optimizations. In addition, Warp processors greatly
improve upon previous dynamic optimization approaches, such as BOA and
Dynamo. Previous approaches utilize dynamic software optimizations,
generally achieving speedups ranging from 1.1 to 1.3. By performing
hardware/software partitioning at runtime, warp processors are
currently capable of achieving speedups averaging 7.4 and energy savings up to
94%. In the
near future, we expect warp processors to achieve speedups much greater
than an order of magnitude.
The functionality of a warp processor is illustrated in the figure shown above.
Initially, software executes on the microprocessor. As the software
executes, a profiler monitors the software to determine regions of code
responsible for a large percentage of execution time (which we call critical
regions). Once the profiler has identified critical regions, the
warp processor will partition the critical regions to hardware. Hardware
for the critical regions is synthesized in the DPM (dynamic partitioning
module). The DPM then programs the configurable logic to implement the
synthesized hardware. The DPM also updates the software binary so that the
hardware will be used during execution. Finally, the partitioned application
begins executing much faster while consuming less energy.
Warp Tools
Hardware/software partitioning tools typically execute on power
workstations with gigabytes of memory and extremely fast processors.
Warp processors execute these same tools in an on-chip environment.
To make on-chip execution possible, Warp processor have specialized
tools that target the most common regions implemented in hardware,
generally small, frequent loops. These specialized tools are designed
to be very lean, requiring orders of magnitude less memory and
execution time.
The tool flow implemented by Warp processors is shown in the adjacent
figure. Initially, partitioning selects the critical regions identified by
the on-chip profiler that are appropriate for hardware.
Next, decompilation recovers high-level constructs (loops, arrays,
etc.) to create a representation of the code that is more suitable for
synthesis. The decompiled representation is then passed to the
register-transfer synthesis tool that creates a standard hardware binary.
Next, JIT FPGA compilation converts the standard binary into a binary for the
specialized WCLA (Warp Configurable Logic Architecture). During JIT FPGA
compilation, logic synthesis performs logic optimizations to reduce the number
of gates required by the hardware. Technology mapping handles mapping the
gate-level netlist onto configurable logic in the WCLA.
Place and route determines
all connections in the WCLA and then outputs a bitstream that programs
the WCLA. The binary updater modifies the software binary by
replacing the original software loops with hardware initialization and
communication code.
Warp Architecture
|
The Warp processor architecture consists of several components: a main
microprocessor, an on-chip profiler, a dynamic partitioning module (DPM), and
a specialized warp configurable logic architecture (WCLA).
The main microprocessor executes the software partition.
The profiler monitors instruction fetches to
determine the most frequently executed regions of the software.
The DPM is responsible for
executing all CAD tools described earlier. The DPM consists of an
additional microprocessor and a small amount of memory for executing
the CAD tools. The WCLA (Warp Configurable Logic Architecture) is a
specialized configurable logic fabric that allows for very
efficient place and route operations.
|
Results
The charts shown below illustrate the speedup
achievable by Warp processors. The first chart shows the experimental setup.
For these experiments, the main microprocessor consisted of an ARM7 running at
100 MHz. The DPM used an additional ARM7 to execute the CAD tools, which
executed in under two seconds.
The second chart shows the speedups of
the single most frequent region when implemented in hardware compared to the
software-only execution of the same region. The final chart shows overall
application speedup, averaging 7.4, after warp processing has implement multiple
critical regions in hardware. The energy savings for the same experiements
ranged from 38% to 94%.
The benchmarks used in the experiements were selected from PowerStone,
EEMBC, NetBench, and our own benchmark suite.
Experimental Setup
Single Kernel Speedup
Overall Speedup
Thread Warping Speedup
The following shows results of executing highly-parallelizable benchmarks using warp processing (with each entire threads being mapped to FPGAs) versus execution on 4-microprocessor (uP), 8, 16, 32, and 64 micrprocessor systems. Even compared to 64 processor systems with optimistic communication assumptions, warp processing of threads still achieves huge speedups.
People
Miscellaneous Presentations
-
Self-Improving Computing Chips -- Warp Processing, UC Riverside CS&E Colloquium, Oct 2007.
PPT
-
-
Thread Warping -- Int. Conf. on Hw/Sw Codesign and System Synthesis (Austria), Oct. 2007.
PPT
-
Warp Processor: A Dymamically Reconfigurable Coprocessor -- Talk at Intel's System Design Symposium (San Jose), Nov. 2005
PPT
-
Warp Processors -- Talk at ASU, April 2004
PPT
-
Warp Processors -- Talk at IBM Research, Yorktown Heights, Apr 2004
PPT
Publications
-
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators
PDF
G. Stitt and F. Vahid.
Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2007.
-
Binary Synthesis.
G. Stitt and F. Vahid.
ACM Transactions on Design Automation of Electronic Systems (TODAES),
2007 (to appear).
-
A Code Refinement Methodology for Performance-Improved Synthesis from C
PDF
G. Stitt, F. Vahid, W. Najjar
IEEE/ACM International Conference on Computer-Aided Design (ICCAD),
Nov. 2006, pp. 716-723.
-
Warp Processors
PDF
R. Lysecky, G. Stitt, F. Vahid
ACM Transactions on Design Automation of Electronic Systems (TODAES),
July 2006, pp. 659-681.
-
New Decompilation Techniques for Binary-level Co-processor Generation
PDF
G. Stitt, F. Vahid
IEEE/ACM International Conference on Computer-Aided Design (ICCAD),
Nov. 2005, pp. 547-554.
-
Hardware/Software Partitioning of Software Binaries: A
Case Study of H.264 Decode
PDF
G. Stitt, F. Vahid, G. McGregor, B. Einloth
International Conference on Hardware/Software Codesign and System Synthesis
(CODES/ISSS), Sep. 2005, pp. 285-290.
Shows that binary-level partitioning and synthesis of a real
highly-optimized h264 video decoder application is competitive with
source (C) level partitioning/synthesis. Also introduces several
simple C coding guidelines that greatly improve synthesis results.
-
Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware
PDF
A. Gordon-Ross and F. Vahid
IEEE Transactions on Computers, Special Issue-Embedded Systems,
Microarchitecture, and Compilation Techniques in Memory of B.
Ramakrishna (Bob) Rau, Oct. 2005, Vol. 54, Issue 10, pp 1203-1215.
Describes extensive studies resulting in lean profiler hardware
that effectively finds addresses corresponding to frequent loops
in an executing software binary.
-
A Decompilation Approach to Partitioning Software for
Microprocessor/FPGA Platforms
PDF
G. Stitt and F. Vahid
Design Automation and Test in Europe (DATE), March 2005, pp. 396-397
Utilizing advanced decompilation techniques enables synthesis of
hardware from binaries to recover nearly all high-level constructs
that existed in the source code, even for different compiler
optimization levels.
-
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA
Compilation
PDF
R. Lysecky, F. Vahid and S. Tan
IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005
Describes an FPGA routing approach that is lean in terms of runtime and memory, running three times faster while using over 15 times less memory than a popular router, yet creating a critical path that is only 30% longer on average and about equal for very large circuits compared to that other router. Our approach, ROCR (Riverside On-Chip Router), can be useful for methods requiring just-in-time FPGA compilation, like our warp processing method, and future methods using a standard FPGA binary.
-
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning
PDF
R. Lysecky and F. Vahid
Design Automation and Test in Europe (DATE), March 2005
Highlights speedup and energy results of implementing warp processing, which dynamically and transparently remaps software kernels to FPGA using on-chip synthesis tools, for software running on a Xilinx MicroBlaze soft-core processor. Results show competitive performance and energy compared to software on regular "hard core" embedded microprocessors, thus making soft-cores on FPGA even more attractive beyond just their flexibility of putting different numbers of cores and custom circuitry on a single chip.
-
Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure
PDF
G. Stitt, Z. Guo, F. Vahid, and W. Najjar
ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005,
Advanced decompilation methods can make synthesizing FPGA hardware from software binaries competitive with synthesizing directly from C-level source code, even when utilizing an advanced memory structure (smart buffer) requiring knowledge of loops and arrays. Synthesis from binaries provides numerous advantages of language independence, tool independence, portability, and support of legacy code.
-
Dynamic FPGA Routing for Just-in-Time Compilation
PDF
R. Lysecky, F. Vahid, S. Tan
IEEE/ACM Design Automation Conference (DAC), June 2004.
Describes an FPGA routing heuristic for execution on-chip, to support
Just-in-Time compliation for FPGAs
-
A Configurable Logic Architecture for Dynamic Hardware/Software
Partitioning
PDF
R. Lysecky, F. Vahid
Design Automation and Test in Europe Conference (DATE), February 2004.
Describes a simple configurable logic (FPGA) fabric and surrounding architecture specifically intended to support dynamic hardware/software partitioning -- meaning on-chip CAD tools must be able to quickly map a netlist to the fabric.
-
Energy Savings and Speedups from Partitioning Critical Software Loops to
Hardware in Embedded Systems
PDF
G. Stitt, F. Vahid, S. Nemetebaksh
IEEE Transactions on Embedded Computer Systems, January 2004.
Partitioning a program's kernels to FPGA hardware can reduce overall system energy.
-
Dynamic Hardware/Software Partitioning: A First Approach
PDF
PPT
G. Stitt, R. Lysecky, F. Vahid
Design Automation Conference (DAC), 2003, pp. 250-255.
Dynamically partitioning an executing software application onto on-chip FPGA is not only possible, but quite effective.
-
A Codesigned On-Chip Logic Minimizer
PDF
R. Lysecky, F. Vahid
IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October 2003.
Hardware/software partitioning of an on-chip logic minimizer results in 8x speedup and 60% energy savings, improving the usefulness of on-chip logic minimization in a variety of applications.
-
On-Chip Logic Minimization
PDF
R. Lysecky, F. Vahid
Design Automation Conference (DAC), 2003.
Executing a lean form of logic minimization on-chip is feasible and has several immediate applications in networking.
-
Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware
PDF
A. Gordon-Ross, F. Vahid
ACM/IEEE Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES) 2003.
Describes extensive studies resulting in lean profiler hardware that effectively finds addresses corresponding to frequent loops in an executing software binary.
-
The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic
PDF
G. Stitt, F. Vahid
IEEE Design and Test of Computers, November/December 2002, pp. 36-43.
Partitioning critical software kernels to on-chip FPGA improves energy consumption in addition to performance.
-
Hardware/Software Partitioning of Software Binaries
PDF
G. Stitt and F. Vahid
IEEE/ACM International Conference on Computer Aided Design,
November 2002, pp. 164-170.
Performing hw/sw partitioning on software binaries can achieve results similar to a compiler-based approach without imposing restrictions on the high-level language or compiler. Binary-level partitioning also supports partitioning of library code, legacy code, and hand-optimized assembly.
-
Binary-Level Hardware/Software Partitioning of MediaBench, NetBench, and EEMBC Benchmarks
PDF
G. Stitt and F. Vahid
Technical Report UCR-CSE-03-01. January 2003.
Binary-level hw/sw partitioning achieves similar speedups compared to
a compiler-based approach for standard benchmarks from MediaBench, NetBench, and EEMBC.
|