Notes
Outline
Power Analysis of Embedded Software
by Vivek Tiwari, Sharad Malik, and Andrew Wolfe
Introduction…
The use of embedded systems is constantly increasing.
Part of this increase is due to the switch from application specific logic to application specific code running on existing processors.
Introduction…
This change is driven by two distinct forces:
The increasing cost of setting up and maintaining a fabrication line
Increased pressures to reduce the time to market and maintain a reliable schedule
This is why we are seeing the basic unit of computation change from the logic gate to an instruction running on an embedded processor.
Introduction…
Power constraints form an important part of the design specification for most embedded sytems. aka: power critical.
This has led many researchers to focus on power estimation and low-power designs.
Unfortunately, there is little available to help embedded system designers evaluate their designs in terms of the power metric.
Introduction…
The embedded processors currently used in designs usually take one of two routes:
“off the shelf” processors or DSPs
Embedded cores which can be incorporated into a larger chip with other logic and memory
Introduction…
For DSPs, the designer only has the processor information the manufacturer makes available through the data books.
For the embedded cores, the designer only has logic/timing simulation models to help verify the designs.
In both cases there is no lower level information available for power analysis.
The Purpose
Power consumption has been a subject of intense study, but the previous research has adopted a “bottom-up” approach, using detailed layouts and sophisticated power analysis tools, which are expensive and time consuming.
However, no attempts have been made to relate the power consumption to the software that executes it.
Purpose
It is recognized that that the power consumption of the processor varies from program to program, but there is a lack in tools to analyze this variation.
The purpose of the research presented in this paper is to overcome these deficiencies by developing a methodology to easily analyze the power consumption from the execution of a given program.
How the Purpose was met and the logic behind it
The purpose was met by measuring the current drawn by the processor as it repeatedly executes certain instructions or short instruction sequences.
The reason this works is due to that fact that despite modern processors being extremely complex systems of several interacting blocks the internal complexity is hidden behind a simple interface—the instruction set.
How the Purpose was met and the logic behind it
To model energy consumption of a complex system we calculate the power cost of each instruction. Also, in any given program there are also inter-instruction effects, such as the the effect of the circuit state, pipeline stalls, and cache misses.
Thus the sum of the power costs of each instruction plus the power cost of the inter-instruction effects can be an accurate estimate for the power cost of a given program.
Quick Math Summary
The average power is equal to the product of the average current and the supply voltage.
    P = I * Vcc.
Since power is the rate at which energy is consumed, the amount of energy consumed is equal to the product of Power and execution time.  E = P * T.
Finally, the execution time is equal to the product of the number of clock cycles and the clock period.  T = N * t.
How the current was measured
For this study a 40 MHz Intel 486DX2-S CPU was used with 4MB of DRAM.
Although the numbers in this report are specific to this processor, the methodology used in the model is widely applicable.
The current was measured using a standard, dual-slope integrating digital ammeter.
How the current was measured
The programs being considered were put in infinite loops and current readings were taken.
The main limitation of this approach is that it will not work for programs with larger execution times since the ammeter may not show a stable reading. Since the main use of this approach was in determining the current drawn during a particular instruction, this isn’t much of a problem.
Base Energy Cost
The base cost for an instruction is determined by constructing a loop with several instances of the same instruction. The average current being drawn is then measured. This current is then multiplied by the number of cycles taken by each instance of the instruction.
Base Cost Examples
Here is an example of CPU base costs for some of the instructions. The numbers in Column 3 are the observed average current values. The overall base energy cost of an instruction is the product of Column 3,4, and the constants Vcc and T.
Base Energy Cost
It is important not to oversize the loops that are used to determine the base costs of your program.
Inter-Instruction Effects
When sequences of instructions are considered, certain inter-instruction effects come into play, which are not reflected in the cost computed solely from base costs.
Here are the three areas in which this occurs; circuit state, resource constraints, and cache misses. This is an overview since the paper becomes slightly more involved.
Inter-Instruction Effects:     circuit state
The switching activity in a circuit is a function of the present inputs and the previous state of the circuit. Thus, it can be expected that the actual energy cost of executing an instruction in a program may be different from the instruction’s base cost. This is because the previous instruction in the given program and in the program used for base cost may be different.
Inter-Instruction Effects:     circuit state
For Example, consider this loop:                                   XOR    BX, 1                                                         ADD    AX, DX
The base costs of the XOR and ADD instructions are 319.2 and 313.6 mA. The expected base cost would be their average, 316.4, but in actuality the cost is 323.3. This is because the base costs are determined while executing the same instruction over and over again.
Inter-Instruction Effects:     circuit state
The cost of a pair of instructions is always greater then the base cost of the pair and the difference is termed the circuit state overhead.
On a final note, after extensive study it was found the circuit state overhead has a limited range—between 5.0 mA and 30.0 mA and most frequently is around 15.0 mA.
Inter-Instruction Effects:     Resource Constraints
Resource constraints in the CPU can lead to stalls e.g. pipeline stalls and write buffer stalls.
These can be considered as another kind of inter-instruction effect since they cause an increase in the number of cycles needed to execute a sequence of instructions.
Inter-Instruction Effects:     Resource Constraints
The energy cost of each kind of stall is determined through experiments that isolate the particular kind of stall.
For example, an average cost of 250 mA for stall cycles was determined for the prefetch buffer stall.
Inter-Instruction Effects:     Resource Constraints
It has been observed that the cost of stalls can show some variation depending upon the instructions involved in the stall.
However, in general the use of a single average cost value for each stall type is sufficient.
Inter-Instruction Effects:     Resource Constraints
To account for the energy cost of the stalls during program cost estimation, the number of stall cycles has to be multiplies by the experimentally determined stall energy cost. This product is then added to the base cost of the program. The number of stall cycles is estimated through a traversal of the program code.
Inter-Instruction Effects:     Cache Misses
The last effect studied was the effect of cache misses.
For a cache miss, a certain cycle penalty has to be added to the instruction execution time, which leads to extra cycles being consumed, which leads to an energy penalty.
Inter-Instruction Effects:     Cache Misses
An average penalty of 215 mA for cache miss cycles has been obtained. This has to be multiplied by the average number of miss penalty cycles to get the average energy penalty for one miss. Then multiply the average penalty by the cache miss rate and add it to the base cost estimate.
Estimation Framework
Here is an illustration of the estimation process.
This program has three basic blocks, with the average current and number of cycles for each instruction.
Estimation Framework
For each block the two columns are multiplied and the products are summed to get the base energy cost of one instance of the basic block.
The values are 1713.4, 4709.8, 2017.9 respectively.
Estimation Framework
Multiplying the base cost of each basic block by the number of times it is executed and adding the cost of the jump we get a number proportional to the total energy cost of the program.
Estimation Framework
Then we divide it by the estimated numbed of clock cycles, 72 and we get an average current of 369.1 mA. Adding the circuit offset value of 15.0 mA and we get 384.0 mA. The actual measured current is 385.0 mA.
Final Overview of Technique
Optimization Note
While the reordering of a given set of instructions in a piece of code may have a limited impact on the energy cost, the choice of which instructions are used in the generated code can significantly affect the cost.
Summary
This paper presents a methodology for analyzing the energy consumption of embedded software.
The motivation for the analysis is three-fold.
It provides insight into the energy consumption in processors.
It can be used to help verify if an embedded design meets its energy constraints and guide the development so that it does meet the constraints.
Attempts at code re-writing demonstrate significant power reductions—justifying the motivation for such a power analysis technique.
System and architecture-level power reduction of  microprocessor-based communication and multi-media applications
By Lode Nachtergaele, Vivek Tiwari, Nikil Dutt
Introduction…
Current microprocessor architectures have become dominated by the data access bottleneck in the cache, system bus and main memory subsystems. These systems also have a large influence on the systems power consumption.
Introduction…
In order to provide high data throughput at reasonable power consumption for these demanding applications, novel solutions for the memory access and data transfer will have to be introduced. These will have to be both at the processor architecture and the compiler level.
Introduction…
The question this paper addresses in this paper is what would these solutions look like.
The paper shows that these solutions will be based on processor architecture optimizations, sophisticated application of compiler technology, and exploiting the interface between system hardware/software.
Architecture Optimizations
 Due to the dependence of power on voltage, voltage reduction is the most favored method of reducing power.
It has been shown that aggressive voltage reductions are possible if architectural and algorithmic transformations are applied to the problem (pipelining and parallelism) to regain the lost performance of voltage reduction.
This works well for throughput-oriented limited-function applications(e.g. digital filtering).
Architecture Optimizations
More recently, architectural optimizations aimed primarily at power reduction have become an active area of research.
Here are the main ideas classified by theme.
Module Parameter Tradeoffs
This configures the caches, register-files, etc. to the optimal size for the desired power/performance.
Exploiting locality both for instructions and data
Creating mini-caches and mini-TLBs to avoid the cost of looking up the larger main cache.
Value locality which saves the most recent computations to avoid re-computation.
Enabling more power down
Partitioning the cache to allow one necessary bank to be powered up and    word-width wise partitioning of data paths.
Speculation Reduction
Dynamically reducing the speculation in the machine to reduce power—e.g. limiting instruction issue if the number of predicted branches exceeds a limit.
Hardware hooks to allow for more software control on power
A loop cache into which basic blocks are statically allocated by the compiler.
Architecture Optimizations
Optimizations such as the last slides are local to the CPU. Power reduction techniques of a wider scope are possible if the CPU is seen as a component of an overall system.
This allows for each component to be powered-up or down whenever appropriate.
This motivates the application of dynamic power management systems.
Architecture Optimizations
For embedded applications, there are opportunities for additional flexibility and power management systems tuned for specific applications be extremely efficient means for power reduction.
Improved modeling of system behavior has gained a lot of attention lately, along with improved power management policies.
Architecture Optimizations
An additional source of power efficiency comes from extending power management to include control on the CPU’s voltage and performance.
Dynamic voltage/freq has high potential, but required a unified hw/sw approach.
Multi-media applications are ideally suited for dynamic voltage/freq scaling since they often have regular activity patterns that can be pre-characterized.
Optimized Platform Mappings
In the domain of algorithm transformations and compilation technology for embedded data-dominated applications, there has been a lot of work for the traditional metrics of cost and performance.
Decision made at this stage heavily influence the final outcome when the appropriate architectural issues of the embedded memories are correctly incorporated.
Optimized Platform Mappings
This has to happen at the instruction-level parallelism compiler and in the preceding system compilation stages.
System-level Code Transformations
Exploration of Data Transfer and Storage (DTS) is an important pre-compilation step.
The reduction of size and number of transfers decrease both the power consumption of the memory system while preserving the behavior.
System-level Code Transformations
The major principles of source-to-source transformations of the DTSE methodology are:
Global data-flow transformation to avoid redundant transfers.
Global loop and control flow transformations to increase locality of reference.
Data reuse exploration to exploit the available memory hierarchy.
SDRAM memory organization.
Data layout decisions to reduce the memory size and improve the cache hit rates.
Platform Compiler Technology
Early experiments demonstrated reduced energy consumption through improved register allocation, resulting in fewer spills to memory.
Compiler techniques that improves data locality through coarse-grain transformations and data layout optimization, result in significantly fewer cache misses, leading to improved performance and lower power dissipation.
Platform Compiler Technology
Similarly, instruction scheduling techniques to reduce instruction cache misses have been developed, resulting in reduced bus transition per off-chip memory transfer.
Recent work in memory-aware compilation aims to better exploit memory access protocols of contemporary DRAMs for improving the memory bandwidth of applications.
Platform Compiler Technology
The effects of such compiler optimizations on power dissipation require a comprehensive measurement or simulation environment, since the relationship between performance and power or energy is not easily predictable.
Platform Compiler Technology
Finally, compiler-controlled power management techniques are beginning to appear, that dynamically tradeoff power for performance. The compiler, through a combination of static analysis, profile-driven data and feedback driven optimization, can thus modify the power/performance characteristics of the target architecture, in consort with system-level power management schemes.