1 June 2009 ACM/SIGDA E-NEWSLETTER Vol. 39, No. 6 Online archive: http://www.sigda.org/newsletter =============================================================================== ... =============================================================================== What is Hardware/Software Partitioning? --------------------------------------- Frank Vahid Professor, Dept. of Computer Science and Engineering, University of California, Riverside, http://www.cs.ucr.edu/~vahid Hardware/software partitioning is the problem of dividing an application's computations into a part that executes as sequential instructions on a microprocessor (the "software") and a part that runs as parallel circuits on some IC fabric like an ASIC or FPGA (the "hardware"), such as to achieve design goals set for metrics like performance, power, size, and cost. The circuit part commonly acts as a coprocessor for the microprocessor. For example, a video compression application may be partitioned such that most of the frame handling computations execute on a microprocessor, while the compute-intensive DCT (discrete cosine transformation) part of the compression application is offloaded to execute in a fast DCT coprocessor circuit. Circuits can execute some computations thousands of times faster than sequential instructions, due largely to their parallel execution. For example, if a computation consists of 100 multiplications of independent data items, then while a microprocessor would have to execute the multiplications one (or a few) at a time thus requiring hundreds of clock cycles, a circuit could potentially (subject to data availability) execute all 100 multiplications in parallel using 100 multipliers and thus requiring just 1 or a few clock cycles. Energy reductions can also result. While designers have necessarily had to perform such partitioning since the early days of computing, the field of "automated hardware/software partitioning" came into being in the early 1990s soon after the advent of high-level synthesis (HLS) in the late 1980s. HLS automatically converted programs into circuits, complementing earlier-developed compilers that converted programs into sequential machine instructions. With both HLS and compilers, the idea of writing a single program (or "executable specification") from which both instructions and circuits could be *automatically* generated -- thus replacing human-performed partitioning from non-executable specifications -- became conceivable, and the field was born [1][2][3]. Early research could be found in the International Symposium on System Synthesis (ISSS), which began in the 1990s from the High-Level Synthesis Workshop, and in the Int. Workshop on Hardware/Software Codesign (CODES), which also started in the early 1990s. The two forums merged in the 2000s to form CODES/ISSS, now part of Embedded Systems Week (www.esweek.org). Extensive work can also be found in DAC (www.dac.com), ICCAD (www.iccad.com), and DATE (www.date-conference.com). Hardware/software partitioning encompasses not only the problem of partitioning the computations per se, but also involves profiling, developing target architectures with efficient communication, co-simulating both the microprocessor and circuit parts, co-debugging, creating specification languages suitable for both parts, and more [4][5]. Various target architectures provide the means for the microprocessor and coprocessors to communicate data. Loosely coupled coprocessors may communicate with the microprocessor via a shared memory and DMA on a peripheral bus. More tightly coupled coprocessors may have more equal access to a shared memory and the system bus. The coprocessor circuit can even be integrated directly into a microprocessor's datapath with access to the register file, in which case the circuit is viewed more as an extended part of the datapath and processor is known as an extended or application-specific instruction-set processor (ASIP), with the partitioning occurring during compilation. At the other extreme, partitioning can generate processing circuits that execute as peers to microprocessors rather than as coprocessors, and/or that co-exist not just with a single microprocessor but with many microprocessors, thus generating a custom multiprocessing architecture as part of an automated "system synthesis" process. Different target architectures require tools that partition at different levels of granularity, including the instruction level (for the ASIP scenario), the loop or subroutine level (for coprocessor scenarios), and even the process level (for system synthesis scenarios). Today, the advent of FPGAs has made hardware/software partitioning even more relevant and important in the field of computing. FPGAs implement circuits via software: A synthesis tool generates a software bitstream, just as a compiler generates a software binary, and that bitstream is then downloaded into the FPGA's program memory, just as a binary is downloaded into a microprocessor's program memory. FPGAs have begun to co-exist with microprocessors for computing purposes (not just glue logic purposes) on nearly every type of compute platform, including desktop computers (e.g., Intel's QuickAssist, and AMD's Opteron systems), supercomputers (e.g., SGI's Altix), and even in mobile and handheld devices [6]. As such, hardware/software partitioning has also been addressed heavily by the FPGA reconfigurable computing community, via development of new compilers that target FPGAs and microprocessor/FPGA combinations. Key hurdles to speedups via FPGA, such as the memory bottleneck problem, have been addressed by new techniques, such as "smart buffers" [7] that actively fetch and maintain the data needed by coprocessors. In fact, in light of FPGAs implementing circuits as software, the term "hardware/software partitioning" is today a misnomer [8]. "Instruction/circuit partitioning" might be more apt today, with the output of partitioning potentially being all software destined for microprocessors and FPGAs. Ultimately, partitioning may eventually be considered just a step within compilation, along with existing steps like parsing and code generation. Furthermore, just as instruction software today is commonly translated just-in-time (JIT) by computing platforms from one instruction set to another (e.g., Java bytecode JIT compiled to a native instruction set, or x86 code JIT compiled to a VLIW instruction set), one can conceivably JIT partition instruction software to circuit software, a process known as "warp processing" [9]. JIT partitioning to FPGAs is more involved than JIT compiling to microprocessors, but continued improvements in compute platforms coupled with a new focus on synthesis from binaries, and fast synthesis, place and route techniques for FPGAs, can make such partitioning feasible, just as JIT compilation was once considered too time-consuming but eventually became feasible. JIT partitioning to FPGAs introduces new problems whose solution requires a combination of techniques from synthesis, online algorithms, reconfigurable computing, and architecture. For example, given a dynamically-determined set of applications competing for limited FPGA space, decisions must be made as to which application may use the FPGA for coprocessing, which of the application's coprocessors should be mapped to the FPGA, where they should be placed within the FPGA, which existing coprocessors they should replace, whether dynamic reconfiguration should be used to logically increase the FPGA size, and much more. [1] F. Vahid and D. Gajski. Specification Partitioning for System Design. Design Automation Conference, June 1992. [2] R. Gupta and G. DeMicheli. Hardware-Software Cosynthesis for Digital Systems, IEEE Design and Test, Sep. 1993. [3] J. Henkel and R. Ernst. Hardware-Software Cosynthesis for Microcontrollers, IEEE Design and Test, Sep. 1993. [4] W. Wolf. A Decade of Hardware/Software Codesign. IEEE Computer, April 2003. [5] F. Balarin, et al. Hardware-Software Co-Design of Embedded Systems: The POLIS Approach. Kluwer, 1997. [6] M. LaPedus. PLDs Jockey to Set New Lows in Cost, Power Budgets. EE Times, June 2008, http://www.eetimes.com/news/semi/showArticle.jhtml?articleID=208401267 [7] Z Guo, B Buyukkurt, W Najjar. Input Data Reuse in Compiling Window Operations onto Reconfigurable Hardware. ACM SIGPLAN Notices, 2004. [8] F. Vahid. It.s Time to Stop Calling Circuits Hardware. IEEE Computer, Sep. 2007. [9] F Vahid, G Stitt, R Lysecky. Warp Processing: Dynamic Translation of Binaries to FPGA Circuits. IEEE Computer, July 2008.