1 June 2009       ACM/SIGDA E-NEWSLETTER       Vol. 39, No. 6

   Online archive: http://www.sigda.org/newsletter

===============================================================================
...
===============================================================================

What is Hardware/Software Partitioning?
---------------------------------------

Frank Vahid
Professor, Dept. of Computer Science and Engineering, University of
California, Riverside, http://www.cs.ucr.edu/~vahid

Hardware/software partitioning is the problem of dividing an application's
computations into a part that executes as sequential instructions on a
microprocessor (the "software") and a part that runs as parallel circuits on
some IC fabric like an ASIC or FPGA (the "hardware"), such as to achieve
design goals set for metrics like performance, power, size, and cost. The
circuit part commonly acts as a coprocessor for the microprocessor. For
example, a video compression application may be partitioned such that most of
the frame handling computations execute on a microprocessor, while the
compute-intensive DCT (discrete cosine transformation) part of the compression
application is offloaded to execute in a fast DCT coprocessor circuit.
Circuits can execute some computations thousands of times faster than
sequential instructions, due largely to their parallel execution. For example,
if a computation consists of 100 multiplications of independent data items,
then while a microprocessor would have to execute the multiplications one (or
a few) at a time thus requiring hundreds of clock cycles, a circuit could
potentially (subject to data availability) execute all 100 multiplications in
parallel using 100 multipliers and thus requiring just 1 or a few clock
cycles. Energy reductions can also result.

While designers have necessarily had to perform such partitioning since the
early days of computing, the field of "automated hardware/software
partitioning" came into being in the early 1990s soon after the advent of
high-level synthesis (HLS) in the late 1980s. HLS automatically converted
programs into circuits, complementing earlier-developed compilers that
converted programs into sequential machine instructions.  With both HLS and
compilers, the idea of writing a single program (or "executable
specification") from which both instructions and circuits could be
*automatically* generated -- thus replacing human-performed partitioning from
non-executable specifications -- became conceivable, and the field was born
[1][2][3].

Early research could be found in the International Symposium on System
Synthesis (ISSS), which began in the 1990s from the High-Level Synthesis
Workshop, and in the Int. Workshop on Hardware/Software Codesign (CODES),
which also started in the early 1990s. The two forums merged in the 2000s to
form CODES/ISSS, now part of Embedded Systems Week (www.esweek.org). Extensive
work can also be found in DAC (www.dac.com), ICCAD (www.iccad.com), and DATE
(www.date-conference.com).  Hardware/software partitioning encompasses not
only the problem of partitioning the computations per se, but also involves
profiling, developing target architectures with efficient communication,
co-simulating both the microprocessor and circuit parts, co-debugging,
creating specification languages suitable for both parts, and more [4][5].

Various target architectures provide the means for the microprocessor and
coprocessors to communicate data. Loosely coupled coprocessors may communicate
with the microprocessor via a shared memory and DMA on a peripheral bus. More
tightly coupled coprocessors may have more equal access to a shared memory and
the system bus. The coprocessor circuit can even be integrated directly into a
microprocessor's datapath with access to the register file, in which case the
circuit is viewed more as an extended part of the datapath and processor is
known as an extended or application-specific instruction-set processor (ASIP),
with the partitioning occurring during compilation. At the other extreme,
partitioning can generate processing circuits that execute as peers to
microprocessors rather than as coprocessors, and/or that co-exist not just
with a single microprocessor but with many microprocessors, thus generating a
custom multiprocessing architecture as part of an automated "system synthesis"
process. Different target architectures require tools that partition at
different levels of granularity, including the instruction level (for the ASIP
scenario), the loop or subroutine level (for coprocessor scenarios), and even
the process level (for system synthesis scenarios).

Today, the advent of FPGAs has made hardware/software partitioning even more
relevant and important in the field of computing. FPGAs implement circuits via
software: A synthesis tool generates a software bitstream, just as a compiler
generates a software binary, and that bitstream is then downloaded into the
FPGA's program memory, just as a binary is downloaded into a microprocessor's
program memory. FPGAs have begun to co-exist with microprocessors for
computing purposes (not just glue logic purposes) on nearly every type of
compute platform, including desktop computers (e.g., Intel's QuickAssist, and
AMD's Opteron systems), supercomputers (e.g., SGI's Altix), and even in mobile
and handheld devices [6]. As such, hardware/software partitioning has also
been addressed heavily by the FPGA reconfigurable computing community, via
development of new compilers that target FPGAs and microprocessor/FPGA
combinations. Key hurdles to speedups via FPGA, such as the memory bottleneck
problem, have been addressed by new techniques, such as "smart buffers" [7]
that actively fetch and maintain the data needed by coprocessors.

In fact, in light of FPGAs implementing circuits as software, the term
"hardware/software partitioning" is today a misnomer [8].
"Instruction/circuit partitioning" might be more apt today, with the output of
partitioning potentially being all software destined for microprocessors and
FPGAs. Ultimately, partitioning may eventually be considered just a step
within compilation, along with existing steps like parsing and code
generation.

Furthermore, just as instruction software today is commonly translated
just-in-time (JIT) by computing platforms from one instruction set to another
(e.g., Java bytecode JIT compiled to a native instruction set, or x86 code JIT
compiled to a VLIW instruction set), one can conceivably JIT partition
instruction software to circuit software, a process known as "warp processing"
[9].  JIT partitioning to FPGAs is more involved than JIT compiling to
microprocessors, but continued improvements in compute platforms coupled with
a new focus on synthesis from binaries, and fast synthesis, place and route
techniques for FPGAs, can make such partitioning feasible, just as JIT
compilation was once considered too time-consuming but eventually became
feasible.  JIT partitioning to FPGAs introduces new problems whose solution
requires a combination of techniques from synthesis, online algorithms,
reconfigurable computing, and architecture. For example, given a
dynamically-determined set of applications competing for limited FPGA space,
decisions must be made as to which application may use the FPGA for
coprocessing, which of the application's coprocessors should be mapped to the
FPGA, where they should be placed within the FPGA, which existing coprocessors
they should replace, whether dynamic reconfiguration should be used to
logically increase the FPGA size, and much more.


[1]     F. Vahid and D. Gajski. Specification Partitioning for System Design.
Design Automation Conference, June 1992.
[2]     R. Gupta and G. DeMicheli. Hardware-Software Cosynthesis for Digital
Systems, IEEE Design and Test, Sep. 1993.
[3]     J. Henkel and R. Ernst. Hardware-Software Cosynthesis for
Microcontrollers, IEEE Design and Test, Sep. 1993.
[4]     W. Wolf. A Decade of Hardware/Software Codesign. IEEE Computer, April
2003.
[5]     F. Balarin, et al. Hardware-Software Co-Design of Embedded Systems:
The POLIS Approach. Kluwer, 1997.
[6]     M. LaPedus. PLDs Jockey to Set New Lows in Cost, Power Budgets. EE
Times, June 2008,
http://www.eetimes.com/news/semi/showArticle.jhtml?articleID=208401267
[7]     Z Guo, B Buyukkurt, W Najjar.  Input Data Reuse in Compiling Window
Operations onto Reconfigurable Hardware. ACM SIGPLAN Notices, 2004.
[8]     F. Vahid. It.s Time to Stop Calling Circuits Hardware. IEEE Computer,
Sep. 2007.
[9]     F Vahid, G Stitt, R Lysecky. Warp Processing: Dynamic Translation of
Binaries to FPGA Circuits. IEEE Computer, July 2008.