|
|
|
|
|
The use of embedded systems is constantly
increasing. |
|
Part of this increase is due to the switch from
application specific logic to application specific code running on existing
processors. |
|
|
|
|
|
This change is driven by two distinct forces: |
|
The increasing cost of setting up and
maintaining a fabrication line |
|
Increased pressures to reduce the time to market
and maintain a reliable schedule |
|
This is why we are seeing the basic unit of
computation change from the logic gate to an instruction running on an
embedded processor. |
|
|
|
|
Power constraints form an important part of the
design specification for most embedded sytems. aka: power critical. |
|
This has led many researchers to focus on power
estimation and low-power designs. |
|
Unfortunately, there is little available to help
embedded system designers evaluate their designs in terms of the power
metric. |
|
|
|
|
|
The embedded processors currently used in
designs usually take one of two routes: |
|
“off the shelf” processors or DSPs |
|
Embedded cores which can be incorporated into a
larger chip with other logic and memory |
|
|
|
|
For DSPs, the designer only has the processor
information the manufacturer makes available through the data books. |
|
For the embedded cores, the designer only has
logic/timing simulation models to help verify the designs. |
|
In both cases there is no lower level
information available for power analysis. |
|
|
|
|
Power consumption has been a subject of intense
study, but the previous research has adopted a “bottom-up” approach, using
detailed layouts and sophisticated power analysis tools, which are
expensive and time consuming. |
|
However, no attempts have been made to relate
the power consumption to the software that executes it. |
|
|
|
|
It is recognized that that the power consumption
of the processor varies from program to program, but there is a lack in
tools to analyze this variation. |
|
The purpose of the research presented in this
paper is to overcome these deficiencies by developing a methodology to
easily analyze the power consumption from the execution of a given program. |
|
|
|
|
|
|
The purpose was met by measuring the current
drawn by the processor as it repeatedly executes certain instructions or
short instruction sequences. |
|
The reason this works is due to that fact that
despite modern processors being extremely complex systems of several
interacting blocks the internal complexity is hidden behind a simple
interface—the instruction set. |
|
|
|
|
To model energy consumption of a complex system
we calculate the power cost of each instruction. Also, in any given program
there are also inter-instruction effects, such as the the effect of the
circuit state, pipeline stalls, and cache misses. |
|
Thus the sum of the power costs of each
instruction plus the power cost of the inter-instruction effects can be an
accurate estimate for the power cost of a given program. |
|
|
|
|
The average power is equal to the product of the
average current and the supply voltage. |
|
P = I
* Vcc. |
|
Since power is the rate at which energy is
consumed, the amount of energy consumed is equal to the product of Power
and execution time. E = P * T. |
|
Finally, the execution time is equal to the
product of the number of clock cycles and the clock period. T = N * t. |
|
|
|
|
For this study a 40 MHz Intel 486DX2-S CPU was
used with 4MB of DRAM. |
|
Although the numbers in this report are specific
to this processor, the methodology used in the model is widely applicable. |
|
The current was measured using a standard,
dual-slope integrating digital ammeter. |
|
|
|
|
The programs being considered were put in
infinite loops and current readings were taken. |
|
The main limitation of this approach is that it
will not work for programs with larger execution times since the ammeter
may not show a stable reading. Since the main use of this approach was in
determining the current drawn during a particular instruction, this isn’t
much of a problem. |
|
|
|
|
The base cost for an instruction is determined
by constructing a loop with several instances of the same instruction. The
average current being drawn is then measured. This current is then
multiplied by the number of cycles taken by each instance of the instruction. |
|
|
|
|
Here is an example of CPU base costs for some of
the instructions. The numbers in Column 3 are the observed average current
values. The overall base energy cost of an instruction is the product of
Column 3,4, and the constants Vcc and T. |
|
|
|
|
It is important not to oversize the loops that
are used to determine the base costs of your program. |
|
|
|
|
|
|
When sequences of instructions are considered,
certain inter-instruction effects come into play, which are not reflected
in the cost computed solely from base costs. |
|
Here are the three areas in which this occurs;
circuit state, resource constraints, and cache misses. This is an overview
since the paper becomes slightly more involved. |
|
|
|
|
The switching activity in a circuit is a
function of the present inputs and the previous state of the circuit. Thus,
it can be expected that the actual energy cost of executing an instruction
in a program may be different from the instruction’s base cost. This is
because the previous instruction in the given program and in the program
used for base cost may be different. |
|
|
|
|
For Example, consider this loop:
XOR BX, 1 ADD AX, DX |
|
The base costs of the XOR and ADD instructions
are 319.2 and 313.6 mA. The expected base cost would be their average,
316.4, but in actuality the cost is 323.3. This is because the base costs
are determined while executing the same instruction over and over again. |
|
|
|
|
The cost of a pair of instructions is always
greater then the base cost of the pair and the difference is termed the
circuit state overhead. |
|
On a final note, after extensive study it was
found the circuit state overhead has a limited range—between 5.0 mA and
30.0 mA and most frequently is around 15.0 mA. |
|
|
|
|
Resource constraints in the CPU can lead to
stalls e.g. pipeline stalls and write buffer stalls. |
|
These can be considered as another kind of
inter-instruction effect since they cause an increase in the number of
cycles needed to execute a sequence of instructions. |
|
|
|
|
The energy cost of each kind of stall is
determined through experiments that isolate the particular kind of stall. |
|
For example, an average cost of 250 mA for stall
cycles was determined for the prefetch buffer stall. |
|
|
|
|
It has been observed that the cost of stalls can
show some variation depending upon the instructions involved in the stall. |
|
However, in general the use of a single average
cost value for each stall type is sufficient. |
|
|
|
|
|
|
To account for the energy cost of the stalls
during program cost estimation, the number of stall cycles has to be
multiplies by the experimentally determined stall energy cost. This product
is then added to the base cost of the program. The number of stall cycles
is estimated through a traversal of the program code. |
|
|
|
|
The last effect studied was the effect of cache
misses. |
|
For a cache miss, a certain cycle penalty has to
be added to the instruction execution time, which leads to extra cycles
being consumed, which leads to an energy penalty. |
|
|
|
|
An average penalty of 215 mA for cache miss
cycles has been obtained. This has to be multiplied by the average number
of miss penalty cycles to get the average energy penalty for one miss. Then
multiply the average penalty by the cache miss rate and add it to the base
cost estimate. |
|
|
|
|
Here is an illustration of the estimation
process. |
|
This program has three basic blocks, with the
average current and number of cycles for each instruction. |
|
|
|
|
For each block the two columns are multiplied
and the products are summed to get the base energy cost of one instance of
the basic block. |
|
The values are 1713.4, 4709.8, 2017.9
respectively. |
|
|
|
|
Multiplying the base cost of each basic block by
the number of times it is executed and adding the cost of the jump we get a
number proportional to the total energy cost of the program. |
|
|
|
|
Then we divide it by the estimated numbed of
clock cycles, 72 and we get an average current of 369.1 mA. Adding the
circuit offset value of 15.0 mA and we get 384.0 mA. The actual measured
current is 385.0 mA. |
|
|
|
|
|
|
|
|
While the reordering of a given set of
instructions in a piece of code may have a limited impact on the energy
cost, the choice of which instructions are used in the generated code can
significantly affect the cost. |
|
|
|
|
|
This paper presents a methodology for analyzing
the energy consumption of embedded software. |
|
The motivation for the analysis is three-fold. |
|
It provides insight into the energy consumption
in processors. |
|
It can be used to help verify if an embedded
design meets its energy constraints and guide the development so that it
does meet the constraints. |
|
Attempts at code re-writing demonstrate
significant power reductions—justifying the motivation for such a power
analysis technique. |
|
|
|
|
|
|
Current microprocessor architectures have become
dominated by the data access bottleneck in the cache, system bus and main
memory subsystems. These systems also have a large influence on the systems
power consumption. |
|
|
|
|
In order to provide high data throughput at
reasonable power consumption for these demanding applications, novel
solutions for the memory access and data transfer will have to be
introduced. These will have to be both at the processor architecture and the
compiler level. |
|
|
|
|
The question this paper addresses in this paper
is what would these solutions look like. |
|
The paper shows that these solutions will be
based on processor architecture optimizations, sophisticated application of
compiler technology, and exploiting the interface between system
hardware/software. |
|
|
|
|
Due to
the dependence of power on voltage, voltage reduction is the most favored
method of reducing power. |
|
It has been shown that aggressive voltage
reductions are possible if architectural and algorithmic transformations
are applied to the problem (pipelining and parallelism) to regain the lost
performance of voltage reduction. |
|
This works well for throughput-oriented
limited-function applications(e.g. digital filtering). |
|
|
|
|
More recently, architectural optimizations aimed
primarily at power reduction have become an active area of research. |
|
Here are the main ideas classified by theme. |
|
|
|
|
This configures the caches, register-files, etc.
to the optimal size for the desired power/performance. |
|
|
|
|
Creating mini-caches and mini-TLBs to avoid the
cost of looking up the larger main cache. |
|
Value locality which saves the most recent
computations to avoid re-computation. |
|
|
|
|
Partitioning the cache to allow one necessary
bank to be powered up and
word-width wise partitioning of data paths. |
|
|
|
|
Dynamically reducing the speculation in the
machine to reduce power—e.g. limiting instruction issue if the number of
predicted branches exceeds a limit. |
|
|
|
|
A loop cache into which basic blocks are
statically allocated by the compiler. |
|
|
|
|
Optimizations such as the last slides are local
to the CPU. Power reduction techniques of a wider scope are possible if the
CPU is seen as a component of an overall system. |
|
This allows for each component to be powered-up
or down whenever appropriate. |
|
This motivates the application of dynamic power
management systems. |
|
|
|
|
For embedded applications, there are
opportunities for additional flexibility and power management systems tuned
for specific applications be extremely efficient means for power reduction. |
|
Improved modeling of system behavior has gained
a lot of attention lately, along with improved power management policies. |
|
|
|
|
An additional source of power efficiency comes
from extending power management to include control on the CPU’s voltage and
performance. |
|
Dynamic voltage/freq has high potential, but
required a unified hw/sw approach. |
|
Multi-media applications are ideally suited for
dynamic voltage/freq scaling since they often have regular activity
patterns that can be pre-characterized. |
|
|
|
|
In the domain of algorithm transformations and
compilation technology for embedded data-dominated applications, there has
been a lot of work for the traditional metrics of cost and performance. |
|
Decision made at this stage heavily influence
the final outcome when the appropriate architectural issues of the embedded
memories are correctly incorporated. |
|
|
|
|
This has to happen at the instruction-level
parallelism compiler and in the preceding system compilation stages. |
|
|
|
|
|
|
Exploration of Data Transfer and Storage (DTS)
is an important pre-compilation step. |
|
The reduction of size and number of transfers
decrease both the power consumption of the memory system while preserving
the behavior. |
|
|
|
|
|
The major principles of source-to-source
transformations of the DTSE methodology are: |
|
Global data-flow transformation to avoid
redundant transfers. |
|
Global loop and control flow transformations to
increase locality of reference. |
|
Data reuse exploration to exploit the available
memory hierarchy. |
|
SDRAM memory organization. |
|
Data layout decisions to reduce the memory size
and improve the cache hit rates. |
|
|
|
|
Early experiments demonstrated reduced energy
consumption through improved register allocation, resulting in fewer spills
to memory. |
|
Compiler techniques that improves data locality
through coarse-grain transformations and data layout optimization, result
in significantly fewer cache misses, leading to improved performance and
lower power dissipation. |
|
|
|
|
Similarly, instruction scheduling techniques to
reduce instruction cache misses have been developed, resulting in reduced
bus transition per off-chip memory transfer. |
|
Recent work in memory-aware compilation aims to
better exploit memory access protocols of contemporary DRAMs for improving
the memory bandwidth of applications. |
|
|
|
|
The effects of such compiler optimizations on
power dissipation require a comprehensive measurement or simulation
environment, since the relationship between performance and power or energy
is not easily predictable. |
|
|
|
|
Finally, compiler-controlled power management
techniques are beginning to appear, that dynamically tradeoff power for
performance. The compiler, through a combination of static analysis,
profile-driven data and feedback driven optimization, can thus modify the
power/performance characteristics of the target architecture, in consort
with system-level power management schemes. |
|