| Instructor: | Frank Vahid (vahid@cs.ucr.edu). Office hours TR 2-3, Bourns A207 |
| Class meeting time | TuTh 9:40am-11am, HMNSS 1404 |
| Prerequisites | Background in programming, digital design, and architecture |
| Textbooks | None -- all readings will be online papers. |
| Grade | Based on presentations, participation and a few possible homeworks. |
Title: ITRS roadmap, SOC and design
Summary: The International Technology Roadmap for Semiconductors
is a widely-read document put together by representatives from leading
chip companies. It charts the future of chips and points out
key challenges to be overcome. A general understanding of the trends
in this field is important to understanding the future of embedded
systems -- namely, that chip capacities are going ballistic, and
that designing computing systems that take advantage of such capacity
is very hard. Such capacity may enable totally new approaches to
chip design. Of particular interest to our class are programmable
platforms: pre-designed, over-designed chips for particular domains,
like networking, digital cameras, cell-phones, set-top boxes, or
video games.
Back to online readings table
Title: Platform Tuning for Embedded System Design
Summary: Huge chip capacities are outpacing designer productivity,
meaning such chips are underutilized. Companies with chip-design and
domain expertise are thus creating programmable platforms: pre-designed,
over-designed systems for particular domains, like digital cameras, thus
easing the end-product designer's burden. Such platforms typically
have parameterized architectures. We want to tune those parameters
to give the best performance and power. This paper introduces this
new field of platform tuning, and discusses research being done at
UCR in this field.
Back to online readings table
Title: Power Analysis of Embedded Software : A first step
toward software power minimization
Summary:
1. The goal of this research is to present a
methodology for developing and validating an
instruction level power model for any given processor.
1.1 it is used power estimation
1.2 and power optimization
1.3 to verify a design meet its specification
2. Experimental methods;
2.1 The aim is to provide a method that makes it
possible to talk about the power/energy cost of a
given program on a given processor.
2.2 Hypothesis: By measuring the current drawn by the
processor as it repeatedly executes certain
instructions or certain short instruction sequences,
it is possible to obtain most of the information that
is needed to evaluate the power cost of a program for
that processor.
2.3 Base Energy cost :Constructing a loop of the same
instructions
There are some factors affect the base energy:
the effect of the loop instruction, the
parameter(memory , register) of the instruction and
the cache miss etc.
2.4 Inter-instruction
2.4.1 Circuit state
2.4.2 Effect of resource constrains
2.4.3 Cache Misses
3. Generation of Energy Efficient Code
3.1 Power consumed by instructions with memory
operands is much higher than instructions with
register operands.
Title: High-level power modeling, Estimation, and Optimization
Summary:
1. This paper is a survey of the most successful and
innovative ideas in power modeling estimation and
optimization
2. Power Modeling and Estimation
2.1 Statistical sampling
2.1.1 Static: it relies on probabilistic information
about the input signal and their correlation t
oestimate the internal switching activity of the
circuit.
Limitation: can not get accurate parameters used in
the model
2.1.2 Dynamic: Simulate the circuit in a typical input
stream
Problem is that with simulations
2.2 Probabilistic Compaction
2.3 RT_Level Power Estimation
Use capacitance models for circuits moduls and
activitiy profiles for data or control signals
2.4 Behavioral-level power estimation
3. Synthesis
3.1 Operation scheduling: shut down resources that are
performing useful computations.
3.2 Resource Allocation
3.2.1 three classes of resources: Registers,
functional units and interconnections.
3.3 Multiple Supply Voltage Scheduling
3.3 Control synthesis: it means control the process or
patter the circuit is synthesized.
4: Optimization:
4.1 Bus Encoding
4.2 Control logic: it seems like the Control
synthesis.
4.3 retiming
4.4 shut down Techniques
Back to online readings table
Title: Energy Dissipation in General Purpose Processors
Summary:
This paper investigates energy saving that occur from using
pipelining and super-scalar issue machines. The authors analyze
energy dissipation for three ideal machines where the only energy included
is from reading and writing memories and clocking storage elements. The
results show that both pipelining and super-scalar machines achieve a much
better energy-delay product. For real machines, these results were
roughly half as efficient as the ideal machines.
Back to online readings table
Title: The Design of a High Performance Low Power Microprocessor
Summary:
This paper discusses various power optimization applied to the
Strong ARM 110 processor in order to achieve low power consumption
without sacrificing high performance.
Back to online readings table
Title: The Technology Behind Crusoe Processors
Summary:
This paper discusses the new Crusoe processor from Transmeta.
This processor is an x86 compatible processor that achieves tremendous
power savings while having good performance. The Crusoe uses a highly
power efficient VLIW processor which is surrounded by CodeMorphing
Software. This software handles the dynamic translation of x86
instructions into the instruction set of the VLIW processor. This
software reduces the size of the Crusoe to 1/4 of the size of the Pentium
III and allows for many performance optimization such as a translation
cache.
Back to online readings table
Title: The SimpleScalar Tool Set, Version 2.0
Summary:
This paper presents the SimpleScalar Tool Set, which is a set
architecture simulator tools. SimpleScalar simulates a MIPS-like
(actually, a superset of the MIPS-IV instruction set architecture)
architecture at the software level. It provides five different
simulators that focus on different aspects of the architecture. While
sim-fast is a functional simulator providing quick results without too
much statistics (and without timing information), sim-outorder is a
detailed, low-level simulator that simulates the microarchitecture
cycle-by-cycle. The paper gives an overview of the entire tool set, and
how it is structured, and how it can be used.
Back to online readings table
Title: Wattch: A Framework for Architectural-Level Power Analysis and Optimizations
Summary:
This paper presents a system (Wattch) that adapts the
SimpleScalar simulator for power analysis. Power analysis is done at the
architectural level, and the simulation is built on top of the
sim-outorder simulator (of the SimpleScalar tool set). The paper
describes the various power optimizations that can be done at the
architectural level, and accounts for these optimizations in the
simulation. The simulation itself provides a platform to analyze
different configurations, optimizations and strategies to save power. It
describes how these simulation results can be used by computer
architects and compiler writers to deliver architectures that are
power-efficient as well as performance-driven.
Back to online readings table
Title: The Filter Cache: An Energy Efficient Memory Structure
Summary:
The motivation for this paper is based on the high power consumption of
microprocessor caches which often occupy a significant area of the chip.
They propose to trade performance for power consumption with the addition
of a very small "filter" cache located between the CPU and the L1 cache.
This cache exploits the high locality of reference of embedded software
that tends to execute small blocks of code very frequently eliminating the
need for a large cache that will only partially be used at a given time.
This filter cache will consume less power and access time will be faster
than that of an L1 cache because of the smaller size. However, insertion
of the filter cache will increase the wait time on a cache miss and will
degrade performance. They show that the decrease in energy consumption
out weighs the decrease in performance and yields good savings.
Back to online readings table
Title: Energy and Performance Improvements in Microprocessor
Design using a Loop Cache
Summary:
This paper takes the premise of a filter cache and improves upon it with
the help of a specially designed compiler that tries to keep the most
frequently executed instructions in the small cache. The goal of this
design is to reduce power consumption while not decreasing the
performance. They use a small L-cache between the L1 cache and the CPU
to hold blocks of instructions. A trace run is done on the code to
determine instruction frequencies. Then a compiler is used to analyze the
code and determine what blocks should be placed into the L-cache. These
blocks are marked and are the only ones that will be brought into the
L-cache. This reduces cache misses because infrequent instructions are
not using up space in the L-cache and the most frequently accessed
instructions are stored. They show that is some cases, performance can
be improved.
Back to online readings table
Title: Instruction Fetch Energy Reduction Using Loop Caches
For Embedded Applications with Small Tight Loops
Summary:
The motivation for this paper is to exploit small loops in program
execution. A very small cache is placed on chip close to the CPU to store
currently executing instructions. A controller is used to know where the
next instruction will be serviced from eliminated any performance
degradation. Because of the small size of the loop cache and its close
proximity to the CPU, power consumption for accesses to it are much less
than accesses to L1 cache. They show that this loop cache can
significantly reduce accesses to main cache which would inturn reduce
power consumption.
Back to online readings table
4/29/01 Summaries
An Energy Conscious Methodology for Early Design Exploration of Heterogeneous
+DSP's M. Wan, Y. Ichikawa, D. Lidsky and
J. Rabaey.
Summary: The writers of this paper present a methodology for hardware and
+software partitioning that is geared towards
conserving energy and emphasizes bottleneck detection at a high level of
+abstraction. This methodology also tries to
give an early prediction of the effects of different architecture choices. They
+start with a basic architecture which
was proposed in the Pleiades project and the DSP algorithm (preferably in subset
+of C/C++). They then determine the
computational kernels within the given algorithm. With these kernels they
+determine the cost (power, delay, area) of
that function and use what they call a "macromodel." Given these macromodels
+the designer can make algorithm changes
and architecture decisions. The paper presents an example of a voice processor
+which they applied this methodology to
and the results from it.
A Low Power Hardware/Software Partitioning Approach for Core-based Embedded
Systems. J. Henkel, DAC99.
Summary: This paper concentrates on partitioning software and hardware in a
manner that maximizes "utilization." The term "utilization" is used to
describe the percentage of gate transitions that are necessary for a given
calculation. Also, Dr. Henkel points out that the methodology introduced in
this paper focuses on reducing power of the whole system and does not focus
primarily on any particular sub-system while ignoring it's effects on other
sub-systems. The paper contains several rather complex algorithms and ends
with some results obtained from experimentation.
Energy-Conscious HW/SW Partitioning of Embedded Systems: A Case Study on an
MPEG-2 Encoder. J. Henkel and Yanbin Li, CODES98.
Summary: This paper (which pre-dates Dr. Henkel's paper "A Low Power
Hardware/Software Partitioning Approach for Core-based Embedded Systems.) is a
case study of the hardware/software partitioning for an MPEG-2 Encoder. They
introduce a way to estimate and optimize power dissipation. The results are a
substantial energy savings with little or no performance degradation.
Back to online readings table
5/1/01 Summaries
Title:Adaptive Address Bus Coding for Low Power Dee Sub-Micron Designs
Summary:This paper presents new address bus coding methods that take coupling
capacitance
into consideration as well as base capacitance according to a physical bus
model for
power consumption. The authors assign the most active bit lines to those bit
lines
expected to possess the smallest capacitance. And then the value of a new
measurement (ETAM)
raised from the model decides whether to invert the bus or not. The result shows
that the
power/energy saving up to 56% is achieved compared to the Gray code encoding.
Title:Synthesis of Low-Overhead Interfaces for Power-Efficient Communication
over Wide Bus.
Summary:This paper tells us one algorithm of encoding to reduce the transition
activity
on system address buses when the statistic information of data is known. And
the author
gives not only two approximations of the exact algorithm but also an adaptive
architecture
without a-priori knowledge. Results have demonstrated that this approach has a
better
performance than low-power encoding schemes in the past.
Title:Address Bus Encoding Techniques for System-Level Power Optimization.
Summary:Based on the advantages of several previous buses encoding schemes
(Bus Invert, T0, etc), the authors combine these ideas in order to optimize the
+system's
power budget. For the architecture with the different kinds of address buses,
+various combinations
of schemes are developed. In the experiments of a real system, the effectiveness
of such encoding has been showed.
Back to online readings table
5/3/01 Summaries
Title: Architectural Power Optimization by Bus Splitting
Summary: By placing tri-state buffers strategically along the internal
communications bus in SOC with multiple modules, power savings can be made
by decreasing the effective load of the bus. As there where no existing
benchmarks for testing SOC a numerical result was given. The problem was
shown to be NP-hard and therefore partitioning of the problem into clusters
is necessary for larger problem sets. Overall bus splitting has the
potential to decrease power and increase performance by multi processes
occurring concurrently, although the latter was not discussed in the paper.
Title: Communication Architecture Tuners A Methodology for the Design of
High Performance Communication Architectures System-on-Chips
Summary: An award winning paper that introduces the new idea of varying the
communication protocol parameters to meet the changing demands of real time
SOC through the additional control circuitry called a CAT (Communication
Architecture Tuners). Several real world problems of this type where shown
to improve the number of missed deadlines by varying factors and even reduce
the number of missed deadlines to zero in the TCP/IP example.
Title: Bus-Invert Coding for Low-Power I/O
Summary: A well referenced paper that covers the finite problem of
minimizing the I/O bus switching activity for random data accesses. The use
of a control line for inverting the current bus signal if the next signal is
less than the hamming distance is purposed and shown to be the best possible
solution for random data.
Back to online readings table
5/8/01 Summaries
Title: Selective Instruction Compression for Memory Energy Reduction in Embedded
Systems.
This paper proposes a Selective Instruction Compression method for memory energy
reduction. The main idea is based on the fact that a given embedded program
normally uses only a small subset of the instructions. Those instructions are
picked out to fit in an IDT table close to processor, and only the compressed
codes are saved in the memory. The decompression is performed on the fly
between processor and memory. By then, no changes to the processor architecture
are required. They provide four possible architectures for the decompression
design. The simulation result shows all of them achieve significant power
reduction, and some even improve the execution performance. This is new since
previous work always assumes a performance degradation.
Title: A Power Reduction Technique with Object Code Merging for
Application Specific Embedded Processors.
Summary: The motivation for this paper is to exploit the frequently executed blocks in
programs. By merging those sequences of object code into a set of single
instructions, a significant energy reduction can be achieved. The merged object
code is restored in a decompressor ROM, and decoded before running. The authors
give an algorithm to find out the basic block and optimize the energy cost.
They also show that further power optimization can be achieved by compress
sequential blocks into one single instruction.
Title:Code Compression for Low Power Embedded System Design
Summary: In this paper the authors propose instruction code compression as a system-level
power optimization method. Unlike the previous work, they optimize the power
consumption of a complete SOC base. Sample instructions are divided into four
groups, with different compression method respectively. They also discuss the
bus compaction strategy and decompressor architecture. The comprehensive
experiments suggest that energy/power saving can be achieved at the same or
even higher performance.
Back to online readings table
5/10/01 Summaries
Title:Low Power Unified Cache Architecture Providing Power and Performance
Flexibility -
Summary:The focus of this paper is varying the cache subsytem to maximize the
performance and power saving for a given application. Cache subsystem
parameters explored consisted of write policy, way size, store buffers,
and push buffers. The paper shows that depending on the program to be
executed, your optimal parameters will differ. To test the effects of
varying the cache subsystems Powerstone benchmark suite were used. This
suite consists of embedded and portable applications. Simulations were
run on the gate level as well as on silicon.
Title:A Portable fault-tolerant microprocessor based on the SPARC V8
architecture -
Summary:This paper was basically an introduction to LEON. It explained the
reasons why the developed LEON, the goals for LEON, and the design
decisions made. The European Space Agency needed a processor with better
performance and lower cost for future space applications. The processors
on the market which met some of their metrics posed problems. These
processors did not provided the flexibility desired, the availablity as a
component, they had restriction on usage, etc.. For each of the design
decisions made, portablility and flexibility was the main focus. The
paper also discusses the error detection and fault tolerance methods which
could be employed.
Back to online readings table
5/15/01 Summaries
Custon-Fit Processors: Letting Applications Define Architectrues
This paper presents a system which automatically designs realistic
VLIW architectures highly optimized for one givern application. The
author defines a cost function which has 6 parameters of total number
of ALU, MUL, register, parallel accesses to L2 memory, lactency and
clusters. Speedups of different combination of these parameters are
showed in terms of different benchmarks. It shows that large speedups
can be achieved on color and image processing codes. Also the speedups
for jammed benchmarks are studied. It is showed that the average speedup
is greater for large patition of one algorithm willing to back off.
Customized Instruction-Sets for Embedded Processors.
This paper discusses five major barriers that could hinder customization:
existing binaries, toolchain development and maintenance costs, lost
saving/higher chip cost due to the lower volumes of customized processors,
added hardware development costs, and some factors related to the product
development cycle for embedded products. Also, the petential solutions
to each barrier are presented.
An ASIP Design Methodology for Embedded Systems
This paper presents a unique architecture and methodology to design
Application-Specific Instruction-Set Processors(ASIP) in the embedded
controller domain by cutomizing an existing processor instruction set
and architecture. The authors shows the desin flow, identification of
new instructions, processor architecture, ASIP implementation and
firmware modification. Two examples are used. By customized coding,
the reduction of cycle-counts can be up to 75% and 71%, respectively.
Back to online readings table
5/17/01 Summaries
Title: Media Architecture: General Purpose vs. Multiple Application-Specific
Programmable Processor--Summary: This paper presents a framework that uses
the production-quality ILP compilers and simulation tools to synthesize a
high performance machine for an application. This way makes it possible for
a designer to explore the application-specific programmable processor design
sapce under area constraints. The autor briefly introduces the machine
model, benchmarks, tool and example set of results, approch in this work,
the search problem and the search strategy and argorithm. Experimental
results show for a given compiler technology and benchmarks it is not always
the machines which have greater area have speed-up increase.
Title: Automatic architectural synthesis of VLIW and EPIC
processors--Summary: This paper presents a synthesis system--PICO_VLIW
which automatically design the architecture and micro-architecture of VLIW
processors and their generalization--EPIC. The autor decomposes the process
into 3 inter-related subsystems:Spacewalker, VLIW architecture synthesis and
Elcor for designing an application-specific VLIW. The autor introduces the
PICO_VLIW design flow in detail. He shows the various design steps involved
in the design flow sequence as well as the dependence relationships among
these steps. Experimental results show that the machines which have
function units that are application-specific have better cost/performance
ratio than general-purpose machines.
Back to online readings table
5/22/01 Summaries
Title:System and architecture-level power reduction of microprocessor-based
communication and multi-media applications.
Summary: The authors of this paper identify and
describe some problems related to
memory access. They called them data access bottlenecks. They identify the
need for novel solutions to deal with memory access and data transfer
problems. The novel solutions described in this paper include:
* Processor architecture optimizations
* Improved compiler technology
* Exploring Interface between the system hardware and software.
They describe in detail each one of the mentioned points.
Title: Xtensa: A configurable and extensible processor.
Summary: Xtensa is a fully customizable processor core. Xtensa lets the system
designer select and size only the features required for a given
application. Customers use Xtensa's interface to describe a design and
they will get a processor core and the tools to go with it.
Back to online readings table
5/29/01 Summaries
Title : Memory aware compilation through accurate timing extraction.
Summary : Memory delay is a major bottleneck in embedded systems.
Newer memory modules exhibit efficient accessing modes.This paper
suggests a memory-aware compiler approach that exploits
such efficient accessing modes by extracting accurate timing information.
This would allow the compiler to perform more global optimizations
on the input. Their test cases have shown a 24% improvement
(on an average)over conventional methods.
Title : Global Multimedia System Design Exploration using Accurate
Memory Organization Feedback
Summary : This paper outlines an approach with which different memory
organization design alternatives can be tested using the
feedback given by the tools that they had developed earlier.
The effectiveness of their approach is tested using an
industrial application. Using this approach, they could explore a
substantial part of the design search space in a short design time,
resulting in a very cost-efficient solution that conforms to the design
constraints.
Back to online readings table
5/31/01 Summaries
Title: Power: A First-Class Architectural Design Constraint.
Summary:
Power is a design constraint not only for portable computers and mobile communication devices but also for high-end systems. In this paper, based on the power model for CMOS logic, techniques and ideas to reduce power consumption at logic, architecture and operating systems levels are discussed in detail. To continue reducing power consumption, future challenges facing logic designers, architects and systems builders are discussed.
Back to online readings table
6/5/01 Summaries
Title: AMULET3: a 100 MIPS Asynchronous Embedded Processor.
Summar: AMULET3 is an asynchronous (clockless) implementation of the
32-bit ARM processor core. AMULET3 shows that asynchronous
technology is commercially viable, and it is competitive in
terms of performance, area and power-efficiency, compared
with clocked designs. In this paper, they discuss how AMULET3
is implemented commercially in the DRACO chip. The paper
discusses asynchronous pipelining techniques, power management,
performance, design flow, controller synthesis, high-level
synthesis, timing verification, and a production test strategy.
Title: Power Management in the AMULET Microprocessors.
Summary: Power management techniques for the Amulet microprocessor
are discussed in this paper. Some conventional methods are used
as well as methods in exploiting asynchronous designs. Much
of the discussion focuses on the cache and branch predictor,
but the most interesting aspect about this processor is
that there is no activity unless useful work is carried out.
Power is only consumed when needed. Although, Amulet
processors are not the fastest, they are very good at doing
nothing.
Back to online readings table