UCR CS269: Hardware/Software Engineering of Embedded Systems

Announcement: Group meeting only on Tuesdays Room 315, Individual meeting on Thursday in the office.  Email me for schedule.

Course Information   Class Readings and Presentations   Individual Projects  

CS269 deals with the exciting and rapidly-growing field of embedded computing systems.  The course will present state-of-the-art software and hardware design techniques for embedded computing systems.  Topics include specification models, languages, simulation, partitioning algorithms, estimation methods, model refinement, and design methodology.  

Course information

Instructor Harry Hsieh, (harry@cs.ucr.edu), EBU2 Room 339

Office hours: Tue/Thu 11AM-noon, or by appointment

Class meeting TR 12:30PM-2PM;  EBU2 Room 315
Textbooks None -- all readings will be available online
Prerequisite CS/EE120A(Digital systems) and consent of instructor
Call # and units 18530, 4 units.
Grade Individual Project 45%, Class Presentation & Participation  45%, Attendance 10%

Project should be done individually.  You are encouraged to propose your own project, possibly related to your Ph.D., M.S. thesis work or even B.S. senior project work.  The idea is that it may become a submission to a workshop and/or a chapter in your thesis.  The content should falls somewhat within the broad confine of the course and has comparable depth as the other projects.  A list of potential project will also be available at the beginning of the class.

You are expected to read assigned readings before the date they are presented, to attend classes and actively participate in discussions.  You are also expected to present several papers during the quarter, for which you should study thoroughly, and do outside research as necessary. The presentation slides are provided for you, so your main concern is to understand the material, and can talk about them in convincing fashion.  You are of course welcome to make additional slides if you find the author's slide insufficient.

Attendance will be taken throughout.  Un-excused absence will be penalized.

 

Class Readings and Presentation

You are expected to read assigned readings before the date they are presented, to attend classes and actively participate in discussions.  You are also expected to present several papers during the quarter, for which you should study thoroughly, and do outside research as necessary.  Questions will be posed in class about the material and you are expected to be able to answer them.  Presentation slides are provide of which you need to understand thoroughly.  You are free to contact authors to answer any questions you might have.  You must properly acknowledge the source so there won't be questions of plagiarism.

Date Presenter Topic and assigned reading Presentation
Tu 4/4

Course Introduction and Logistics (EBU2 #315)
pdf
Th 14/6
Individual project meeting (EBU2 #339)


Tu 4/11
Scott Sirowy




Eric Cheung


Locality-Conscious Workload Assignment for Array-Based Computations in MPSOC Architectures
Feihui Li, , Mahmut Kandemir- Pennsylvania State Univ

An Integrated Hardware Software Approach for Run-Time Scratchpad-Management
Francesco Poletti - Univ. Di Bologna, Paul Marchal- IMEC, David Atienza- DACYA/UCM, Luca Benini, - Univ. di Bologna, Francky Catthoor, - IMEC, Jose Mendias, - DACYA/UCM

finalized project assignment

Thu 4/13

Individual project meeting (EBU2 #339)



Tu 4/18
Vi Pham

Malcolm Mumme
paper

paper


Th 4/20

Individual project meeting (EBU2 #339)



Tu 4/25
Scott Sirowy

Eric Cheung
Project introduction presentaiton (25 minutes each person-project.  Presentation should include background, plan of attack, expected result, and progress so far.)

Project introduction presentaiton (25 minutes each person-project.  Presentation should include background, plan of attack, expected result, and progress so far.)

Th 4/27

Individual project meeting (EBU2 #339)

 


Tu 5/2
Vi Pham

Malcolm Mumme



Project introduction presentaiton (25 minutes each person-project.  Presentation should include background, plan of attack, expected result, and progress so far.)

Project introduction presentaiton (25 minutes each person-project.  Presentation should include background, plan of attack, expected result, and progress so far.)

1st draft of project report (~2 pages), background, expected result, plan, progress

Th 5/4

Individual project meeting (EBU2 #339)

Tu 5/9
Scott Sirowy

Eric Cheung
paper

paper
Th 5/11

Individual project meeting (EBU2 #339)

Tu 5/16
Vi Pham

Malcolm Mumme
paper

paper


2nd draft of project report (~4 pages), background, expected result, plan, Result so far

Th 5/18
 
Individual project meeting (EBU2 #339)

Tu 5/23
Scott Sirowy

Eric Cheung
paper

paper
Th 5/25

Individual project meeting (EBU2 #339)

Tu 5/30
Vi Pham

Malcolm Mumme
paper

paper
Th 6/1

Individual project meeting (EBU2 #339)

Tu 6/6
Scott Sirowy

Eric Cheung
Final Project Presentation 25 minutes each person.  Presentation should include background, motivation, procedures, results, and future work.

Final Project Presentation 25 minutes each person.  Presentation should include background, motivation, procedures, results, and future work.

Th 6/8
Vi Pham

Malcolm Mumme

Final Project Presentation 25 minutes each person.  Presentation should include background, motivation, procedures, results, and future work.

Final Project Presentation 25 minutes each person.  Presentation should include background, motivation, procedures, results, and future work.

 

List of papers:

You may choose any other papers from DATE 2006, 2005, 2004, DAC 2005, 2004, ICCAD 2005, and 2004, but the selection must be pre-aproved and you are resoponsible for finding the slides.


From Design Automation Conference 2005:

Locality-Conscious Workload Assignment for Array-Based Computations in MPSOC Architectures (best paper candidate)
Feihui Li, , Mahmut Kandemir- Pennsylvania State Univ., University Park, PA
presenter: Scott Sirowy
pdf ppt

Memory Access Optimization Through Combined Code Scheduling, Memory Allocation, and Array Binding in Embedded System Design
Jungeun Kim - KAIST, Daejeon, South Korea, , Taewhan Kim- Seoul National Univ., Seoul, South Korea
pdf ppt

Dynamic Slack Reclamation with Procrastination Scheduling in Real-Time Embedded Systems
Ravindra R. Jejurikar - Univ. of California, Irvine, CA, , Rajesh Gupta- Univ. of California at San Diego, La Jolla, CA
pdf ppt

Approximate VCCs: A New Characterization of Multimedia Workloads for System-level MpSoC Design (best paper candidate)
Yanhong Liu, Samarjit Chakraborty, , Wei Tsang Ooi- National Univ. of Singapore, Singapore
pdf ppt
 
Simulation Based Deadlock Analysis for System Level Designs
Xi Chen,  Harry Hsieh, , Univ. of California, Riverside, CA, Abhijit Davare, Alberto Sangiovanni-Vincentelli, - Univ. of California, Berkeley, CA, Yosinori Watanabe, - Cadence Berkeley Labs, Berkeley, CA
pdf ppt

Fine-grained Application Source Code Profiling for ASIP Design
Kingshuk Karuri, Mohammad Al Faruque, Stefan Kraemer, Rainer Leupers, Gerd Ascheid, Heinrich Meyr, - RWTH Aachen Univ., Aachen, Germany
pdf ppt
       
Physically-aware HW-SW Partitioning for reconfigurable architectures with partial dynamic reconfiguration
Sudarshan Banerjee, Elaheh Bozorgzadeh, , Nikil Dutt- Univ. of California, Irvine, CA
pdf ppt
       
Cache Coherence Support for Non-shared Bus Architecture on Heterogeneous MPSoCs
Taeweon Suh, , Hsien-Hsin S. Lee- Georgia Institute of Tech., Atlanta, GA, , Daehyun Kim- Intel Corp., Santa Clara, CA
pdf ppt
       
A Low-Latency Router Supporting Adaptivity for On-Chip Interconnects
Jongman Kim, Dongkook Park, Theocharis G. Theocharides, N. Vijaykrishnan, , Chita R. Das, - Pennsylvania State Univ., University Park, PA
pdf ppt
       
Floorplan-aware Automated Synthesis of Bus-based Communication Architectures (best paper candidate)
Sudeep Pasricha, Nikil Dutt, , Elaheh Bozorgzadeh- Univ. of California, Irvine, CA, Mohamed Ben-Romdhane, - Conexant Systems Inc., Newport Beach, CA
pdf ppt
       
From Design Automation Conference 2004

Memory Access Scheduling and Binding Considering Energy Minimization in Multi-Bank Memory Systems
Chun-Gi Lyuh - ETRI, Daejeon, Republic of Korea, , Taewhan Kim- Seoul National Univ., Seoul, Republic of Korea
pdf ppt

Profile-Based Optimal Intra-Task Voltage Scheduling for Hard Real-Time Applications
Jaewon Seo - KAIST, Daejeon, Republic of Korea, , Taewhan Kim- Seoul National Univ., Seoul, Republic of Korea, , Ki-Seok Chung- Hanyang Univ., Seoul, Republic of Korea
pdf ppt
       
Extending the Transaction Level Modeling Approach for Fast Communication Architecture Exploration
Sudeep Pasricha, , Nikil Dutt- Univ. of California, Irvine, CA, , Mohamed Ben-Romdhane- Conexant Systems, Inc., Newport Beach, CA
pdf ppt
 
Specific Scheduling Support to Minimize the Reconfiguration Overhead of Dynamically Reconfigurable Hardware
Javier Resano - Univ. Complutense de Madrid, Madrid, Spain, , Diederik Verkest- IMEC, Leuven, Belgium, , Daniel Mozos- Univ. Complutense de Madrid, Madrid, SpainFrancky Catthoor, , Serge Vernalde, - IMEC, Leuven, Belgium
pdf ppt
       
An Integrated Hardware Software Approach for Run-Time Scratchpad-Management
Francesco Poletti - Univ. Di Bologna, Bologna, Italy, , Paul Marchal- IMEC, Leuven, Belgium, , David Atienza- DACYA/UCM, Madrid, Spain, Luca Benini, - Univ. di Bologna, Bologna, Italy, , Francky Catthoor, - IMEC, Leuven, Belgium, , Jose Mendias, - DACYA/UCM, Madrid, Spain
presenter Eric Cheung
pdf ppt
      
An Efficient Scalable and Flexible Data Transfer Architecture for Multiprocessor SoC with Massive Distributed Memory
Sang Il Han - Seoul National Univ., Seoul, Republc of Korea, , Amer Baghdadi- ENST Bretagne, Brest, France, , Marius Petru Bonaciu- TIMA Lab., Grenoble, France, Soo Ik Chae, - Seoul National Univ., Seoul, Republic of  Korea, , Ahmed Amine Jerraya, - TIMA Lab., Grenoble, France
pdf ppt

Retargetable Profiling for Rapid, Early System Level Design Space Exploration
Lukai Cai, Andreas Gerstlauer, , Daniel Gajski- Univ. of California, Irvine, CA
pdf ppt
       
High Level Cache Simulation for Heterogeneous Multiprocessors
Joshua J. Pieper - Carnegie Mellon Univ., Pittsburgh, PA, , Alain Mellan- STMicroelectronics, San Diego, CA, Joann M. Paul, , Donald E. Thomas, - Carnegie Mellon Univ., Pittsburgh, PA, , Faraydon Karim, - STMicroelectronics, San Diego, CA
pdf slide
       
Automatic Translation of Software Binaries onto FPGAs
Gaurav Mittal, David C. Zaretsky, Xiaoyong Tang, , Prithviraj Banerjee, - Northwestern Univ., Evanston, IL
pdf ppt

Data Compression for Improving SPM Behavior
Ozcan Ozturk, , Mahmut Kandemir- Penn State Univ., University Park, PA, , Ilteris Demirkiran- Syracuse Univ., Syracuse, NYGuangyu Chen, , Mary Jane Irwin, - Penn State Univ., University Park, PA
pdf ppt

A Novel Approach for Flexible and Consistent ADL-Driven ASIP Design
Gunnar Braun - CoWare, Inc., Aachen, Germany, , Weihua Sheng- Institute for Integrated Systems, Aachen, Germany, , Achim Nohl- CoWare, Inc., Aachen, GermanyJianjiang Ceng, Manuel HohenauerHanno Scharwaechter, Rainer Leupers, , Heinrich Meyr, - Institute for Integrated Systems, Aachen, Germany
pdf ppt

Dynamic FPGA Routing for Just-in-Time FPGA Compilation
:Roman Lysecky, Frank Vahid, , Sheldon X.-D. Tan- Univ. of California, Riverside, CA
pdf ppt

Individual Projects

Project should be done individually.  You are encouraged to propose your own project, possibly related to your Ph.D., M.S. thesis work or even B.S. senior project work.  The idea is that it may, with possibly one more quarter of independent study work, become a submission to a workshop and/or a chapter in your thesis.  The content should falls somewhat within the broad confine of the course and has comparable depth as the other projects.  A list of potential project will also be available at the beginning of the class.  Project request is on a first-come-first-serve basis.  You are encouraged to send me e-mail frequently about your progress or any question you may have throughout the quarter.  

Project Deadlines are: 

4/11 Tuesday
Deadline for project request.  Finalized project assignment.
4/18,20,25
Thu, Tue, Thu
Project Intoduction Presentation 25 minutes each person
5/2
Tuesday
1st draft of project report (~2 pages), background, expected result, plan, progress
5/16 Tuesday
2nd draft of project report (~4 pages), background, expected result, plan, progress
6/6,8
Tue, Thu
Final Project Presentation (25 minutes each person)
6/16
Friday
Final Project Report (~6 pages "conference format": 2 column, 11pt, single space)

 

1) Interactive RTL-Gate Mapping Rule Wizard

2) Effective and Efficient RTL-Gate Mapping Engine

3) Combinational RTL-Gate Correlation Engine

4) Automatic RTL-Gate Incremental Synthesis for Engineering Change Order

5) Formal and Semi-Formal Technique for RTL-Gate Correlation

6) A practical C++-to-VHDL Translator

7) Jpeg decoder energy analysis

8) MPSoC Energy Analysis

9) Simulation Deadlock Analysis for SystemC Designs

10) TBA

Project Descriptions

1) Interactive RTL-gate mapping rule wizard

Goal:

Correlating objects from different levels of abstraction for a design has become critical in design verification and debugging. Logic synthesis tools takes a design description in Register Transfer Level (RTL) and synthesize it down to the gate level description through structural transformation, while preserving some name patterns especially for sequential objects.  Since not all the naming are preserved, one of the techniques to correlate RTL description to gate level description use heuristic renaming rules based on prefix and suffix string matching.  This mapping is far from perfect and user-specified renaming rules can be used to improve the accuracy of these mapping.  This project aims to design a wizard tool to assist the user in analyzing the mapping and designing a set of specific renaming rules to improve the mapping when correlating large designs.

 

Procedure:

1.        Study and understand the logic synthesis process and its common transformations (e.g. Synopsys Design Compiler).

2.        Study and understand the current gate-to-RTL correlation technique including heuristic prefix/suffix string matching, Longest-common-subsequence (LCS) algorithms, Levenshtein-distance, and other “spell checking” algorithm such as Wagner-Fisher.

3.        Study and understand user-specified renaming rules for the gate-to-RTL correlation, focusing on regular expressions, hierarchy changes, and their combinations.  Study their effects on the overall correlation results.

4.        Compare (possibly with diff or xdiff) the objects from both levels of abstraction that are not yet mapped, starting from a possibly non-empty list of objects that have already been mapped. For large designs, it is only possible to analyze and display a subset of object at a time (may be different subsets), investigate into how such subsets are chosen. 

5.        In an interactive setting, user adds or changes renaming rules incrementally.  Analyze and present the effect of the change.  Specifying one rule at a time can be very tedious, design an engine so that multiple rules can be specified and represent effectively.

6.        (Bonus) automatically suggest additional rules to improve the results by analyzing the objects and current rules as they are specified (possibly using machine learning techniques).

Expected Product:

Propose a practical usage flow for the interactive rule mapping wizard, including metrics for global mapping and local mapping, display of mapped and unmapped object, and “macros” for ease of rule specifications.


2) Effective and efficient RTL-gate mapping engine

Goal:

Correlating objects from different levels of abstraction for a design has become critical in design verification and debugging. Logic synthesis tools takes a design description in Register Transfer Level (RTL) and synthesize it down to the gate level description through structural transformation, while preserving some name patterns especially for sequential objects.  Since not all the naming are preserved, one of the techniques to correlate RTL description to gate level description use heuristic renaming rules based on prefix and suffix string matching.  This project aims to implement an effective mapping engine to perform such a mapping for large design.  It has to be efficient since realistic designs contain astronomical number of design objects. Heuristics of a very low computational complexity is crucial.

 

Procedure:

1.        Study and understand the logic synthesis process and its common transformations (e.g. Synopsys Design Compiler).

2.        Study and understand the current gate-to-RTL correlation technique including heuristic prefix/suffix string matching, Longest-common-subsequence (LCS) algorithms, Levenshtein-distance, and other “spell checking” algorithm such as Wagner-Fisher.  Understand regular expression and how hierarchy can naturally exist in a design.  Understand the correlation metrics.

3.        Starting with a give correlation engine (will be provided), analyze the “goodness” of the engine quantitatively.  Propose additional mapping algorithms to improve the mapping.  Implement the algorithms.

4.        Show that your tool indeed improves the mapping through realistic, industrial size designs.

5.        (Bonus) Is 100% mapping possible for all design? What is a reasonable metric for 100% mapping?  Can 100% mapping be achieved in reasonably low computation time?

 

Expected Product:

Propose an improvement to the existing automatic mapping engine.  Demonstrate the improvement through industrial size example.

 

3) Combinational RTL-gate correlation engine

Goal:

                At the RTL level, combinational circuits are usually modeled as programming statements such as arithmetic operations and conditional statements. After synthesized to the gate level, these objects are modeled as combinational gates and nets. Due to the optimization present in state-of-the-art logic synthesis tool, a combinational object at the gate level, may not have an exact correspondence at the RTL level. In this project, we assume sequential objects such as registers and latches are correlated exactly. Therefore, the work remains to correlated combinational blocks between sequential boundaries. The goal is to correlate an RTL statement to a minimum set of gates and a gate to a minimum set of RTL statement, given a typical industrial design. Different RTL language constructs are handled differently in the synthesis, so should they be studied case by case in correlation.

 

Procedure:

1.        Study and understand the logic synthesis process and its common transformations (e.g. Synopsys Design Compiler).

2.        Study and understand the current gate-to-RTL correlation technique and its limitations.

3.        Study RTL level HDL language constructs and their synthesis algorithms for combinational objects, such as arithmetic operations, if-else statements, case statements, assignment, and etc. For each category of objects, propose a correlation mechanism to the gate level objects. Devise the metric for correlation and keep in mind that the mapping will not be exclusive (multiple RTL statements correlate to the same gate).  How does sharing and common sub-expression extraction affect the correlation?

4.        Correlate individual gate to RTL statements that may “have something to do with” the gate. Devise metric for correlation.

5.        (bonus) Can we scarify optimality in design for 100% RTL to gate correlation?  What’s the cost?  Justify your answer.

 

Expected Product:

                A correlation engine for RTL->gate on a statement on a statement by statement basis, and a correlation engine for gate->RTL on a gate by gate basis.

 


4) Automatic RTL-gate incremental synthesis for Engineering Change Order (ECO)

Goal:

                This is a further research project based on gate-to-RTL correlation including both sequential objects and combinational objects. Sometime the user would like to change a small part of the design without going through the entire tool flow once again at the late design stage. The process is called engineering change order (ECO). This project focuses on automatic gate level netlist modification from the change in RTL design.

 

Procedure:

1.        Study and understand the logic synthesis process and its common transformations (e.g. Synopsys Design Compiler).

2.        Study and understand the current gate-to-RTL correlation technique and its limitations.

3.        Identify and study possible ECO changes at the RTL level.

4.        Propose automatic mechanism and algorithms for netlist modification from RTL ECO changes based on current RTL-to-gate correlation technique.  Implement the incremental synthesis algorithm. You may consider simple mapping from RTL-gate.  Demonstrate it on a realistic industrial example.

5.        (bonus) When should an incremental synthesis take place and when is re-synthesis of the entire design preferable?  Is simple RTL-gate transformation (i.e. no fancy logical synthesis technique) always possible and what is the cost, in term of being suboptimal, may it be?

 

Product:

                An effective automatic mechanism for gate level netlist modification from the elemental changes in the RTL design.

 


5) Formal and semi-formal technique for RTL-gate correlation

Goal:

                The current design correlation technique relies on user-specified renaming rules and heuristic name changes to perform the correlation rather than formal methods due to performance issues. But it also introduces possible inaccuracy by possibly correlating two objects that are not exact correspondence, which will make the later-on verification or debugging process error-prone. This project is to propose an effective but still efficient technique to verify the correctness of the design correlation with formal or semi-formal methods.  The idea is to start with correlation, and apply formal and semi-formal method locally to keep the complexity under check.

 

Procedures:

1.        Study and understand the logic synthesis process and its common transformations (e.g. Synopsys Design Compiler).

2.        Study the current gate-to-RTL correlation technique. Understand its limitation that may introduce inaccuracy in the correlation results.

3.        Study the efficiency and effectiveness of current equivalence checking engines (e.g. Synopsys Formality)

4.        Study possible techniques or their combinations that can be used to verify the correctness of the correlation including both formal and semi-formal techniques, which include but are not limited to local equivalence checking and random simulation.  Implement a simple tool/script and test it out on a large industrial design.

5.        (bonus) Analysis the theoretical value of correlation, and correlation follow by equivalence.  How much of the correlation do we have to be declare “correct” and how much more equivalence checking we will need in order to declare equivalence in entirety?

 

Expected Product:

Propose a practical solution for the verification of design correlation between different levels of abstraction. Implement a simple tool/script that can “call” correlation and equivalence engine appropriately.

6) A practical C++-to-VHDL translator

Goal:

DSP algorithm is usually developed and written in C/C++ code. When such DSP algorithm is implemented in hardware as a purely hardware implementation, a part of a hw/sw solution, or an FPGA implementation, a manual translation into VHDL/Verilog is usually required.  Such a manual translation is obviously tedious and error prone.  Translation of some of the constructs are obviously straight-forward (e.g. addition, substraction, variable declaration, loop, branch), but sometime it could be entirely non-trivial (e.g. pointer reference, classes, library, dynamic memory allocation). While a couple of recently developed commercial tools may be available for such a task (e.g. Forte, Celoxica),  we want to know exactly what it take to do such a translation automatically and efficiently.
In this project, you will write the translator that will work well at least for the specific example C++ code.
There are several commercial and academic tools available, such as Impulse CoDeveloper and Spark, which can be used as reference. It is recommended to use existing frontend, such as gcc, to save some work and have more time for VHDL backend generator.

Procedure

1. Convert given C++ program to VHDL by hand to understand what have to be done.
2. Research existing C-to-VHDL generators and C frontends
3. Write C-to-VHDL generator for specific examples

Expected Product:

A working translator for given C++ code segment

7) Jpeg decoder energy analysis (or any other data-centric application)

Goal:

As video and audio playback and recording become a requirement in many of the portable devices, such as cell phones, portable media players, and personal digital assistants, energy / power become an important issue when
designing multimedia processing devices to extend the battery life or reduce the battery requirement. These kinds of video and audio processing are usually data-centric. Streams of data are passing through the same lines of codes. There are a lot of research interests using profile driven analysis to reduce the energy / power consumption. Some studies have shown
the possibilities of reducing switching activities in logic gates by minimizing switching activities in behavioral level for certain input data. However, when passing the behavioral level description to logic synthesis tools, the switching activity characteristics in the gate level are undeterministic. In this study we would like to show that for data-centric application, such as jpeg decoder, energy consumption on the application does not depend on the input data. As a result energy
consumption can be annotated to blocks of data-centric hardware description independent of the inputs.  The study involves obtaining HDL jpeg decoder algorithm code, synthesize the HDL code to gate level, run gate level simulations with different images, and calculate energy consumptions.

Procedures:

1. Obtain jpeg decoder C/C++ description
2. Synthesis the HDL to gate level and verify the correctness in gate level
3. Run various images in gate level simulation and calculate energy consumptions
4. Analyze the energy consumptions in different images

Expected Product:

Show with real low level simulated number how power/performance can vary for different "typical" inputs.

8) MPSoC Energy Analysis

Student: Eric Cheung

Goal:

Multimedia processing emerges to be a must in most portable devices. As the desired time-to-market decreases and the complexity of the applications increases, there is an increasing trend of building a single chip integrating multiple processors (MPSoC) to enable designers to map the embedded software to different processors and explore different architectural alternatives to meet the performance and energy constraints. Multimedia processing applications are classified as data-centric applications. Studies found that for data-centric applications, most of the energy is consumed in memory units, i.e. main memory, scratch pad memory, fifo, and registers.

Energy consumption in a single processor has been well studied in the past decades. Researches done by IMEC have concentrated on memory address generation, loop manipulation and source transformation to avoid unnecessary copy of data, maximize the uses of cache to reduce storage requirement, and minimize the data transfer between cache and main memory, therefore reducing the energy to transfer data between cache and main memory.

However in a MPSoC environment, there are multiple processors and each processor has its own cache. Two processes are not able to share the same cache line if they reside in different processors. Many of the energy optimization technique for single processor SoC are no longer applicable in MPSoC environment, and some other energy optimization opportunities emerge.

A very powerful programming model to deal with MPSoC is YAPI, which separate computations and communications in the application. Computation processes are mapped into multiple processors and communication channels are implemented in memory units. Hence the architectural decisions on the implementation of the communication channels will have a huge impact on the energy consumption on the MPSoC device. MPSoC enables a boarder range of architectural implementations of communication channels than the traditional SoC device, and the energy consumptions on different communication channel implementations have not been well studied.

In this study, we first construct a MPSoC environment and analyze the energy consumption in the communications with different communication channel implementations, which we focus on different size and implementation of fifo and scratch pad memory. Then we explore the opportunity to reduce energy consumption by choosing appropriate implementations for different communication channels.

To make the methodology realistic, in the final stage we will use the Picture-In-Picture (PIP) design from Philips to demonstrate the results.

Procedures:

1. First need to build a MPSoC environment which would allow us to simulate an application written in YAPI in the environment and obtain energy consumption in the memory units. A simple producer-consumer example is used to explore the tools. Currently SystemC-HDL co-simulation is used to simulate the MPSoC design. Computations are modeled as SystemC modules, and communications are modeled as hardware. Interface similar to fifo is written to connect the SystemC modules and HDL modules. In this stage, we assume SystemC modules can only communicate using hardware communication implementations.

2. Implement communication channel using mainly two hardware constructs: dedicated fifo, and scratch pad memory. Different arbitration and memory management schemes will be investigated and implemented for sharing the scratch pad memory among multiple YAPI communication channels.

3. Use a more complicated design, such as multiple producer - multiple consumer or mpeg decoder example to investigate the tradeoff between energy consumption, clock cycle and area for memory units by choosing different implementations for communication channels.

4. Use the PIP design from Philips to demonstrate the methodology.

5. Use a cycle accurate processor model to simulate the computations in YAPI software design, YAPI computations can be implemented differently. A YAPI computation can be run in a single processor; multiple YAPI computations can share one processor; or YAPI computation can be implemented as hardware. Investigate the energy consumption, clock cycle and area when with different hardware-software partitioning and different YAPI process mapping.

Expected product:

A methodology and associated tool for exploring different HW/SW partitioning, different communication implementation for energy/performance trade-off and analysis.