

CS161 – Design and Architecture of Computer Systems

**Performance Evaluation** 

UNIVERSITY OF CALIFORNIA, RIVERSIDE



## WHAT IS PERFORMANCE?

## **Understanding Performance**



- > Algorithm
  - > Determines number of operations executed
- > Programming language, compiler, architecture
  - Determine number of machine instructions executed per operation
- > Processor and memory system
  - Determine how fast instructions are executed
- I/O system (including OS)
  - Determines how fast I/O operations are executed

## Response Time and Throughput CR

- Response time
  - How long it takes to do a task
- > Throughput
  - > Total work done per unit time
    - e.g., tasks/transactions/... per hour
- > How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - > Adding more processors?
- > We'll focus on response time for now...

## **Relative Performance**



- Define Performance = 1/Execution Time
- "X is n time faster than Y"

Performance<sub>x</sub>/Performance<sub>y</sub> = Execution time<sub>y</sub>/Execution time<sub>x</sub> = n

- Example: time taken to run a program
  - 10s on A, 15s on B
  - Execution Time<sub>B</sub> / Execution Time<sub>A</sub> = 15s / 10s = 1.5
  - So A is 1.5 times faster than B

## **Relative Performance**



- Define Performance = 1/Execution Time
- "X is n time faster than Y"

Performance<sub>x</sub>/Performance<sub>y</sub> = Execution time<sub>y</sub>/Execution time<sub>x</sub> = n

- Example: time taken to run a program
  - 60s on A, 30s on B
  - Execution Time<sub>B</sub> / Execution Time<sub>A</sub>= 30s / 60s
     = 0.5 So A is 0.5 times faster than B
  - or B is 2 times faster than A

## **Measuring Execution Time**



- > Elapsed time
  - > Total response time, including all aspects
    - Processing, I/O, OS overhead, idle time
  - > Determines system performance
- > CPU time
  - > Time spent processing a given job
    - Discounts I/O time, other jobs' shares
  - Comprises user CPU time and system CPU time
  - Different programs are affected differently by CPU and system performance

## **CPU Clocking**



 Operation of digital hardware governed by a constant-rate clock



- Clock period: duration of a clock cycle
  - e.g., 250ps = 0.25ns = 250×10<sup>-12</sup>s
- Clock frequency (rate): cycles per second
  - e.g., 4.0GHz = 4000MHz = 4.0×10<sup>9</sup>Hz

## **CPU** Time



#### CPU Time = CPU Clock Cycles × Clock Cycle Time

CPU Clock Cycles Clock Rate

- > Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count

## **CPU Time Example**



- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B
  - Aim for 6s CPU time
  - > Can do faster clock, but causes 1.2 × clock cycles
- How fast must Computer B clock be?

$$Clock Rate_{B} = \frac{Clock Cycles_{B}}{CPU Time_{B}} = \frac{1.2 \times Clock Cycles_{A}}{6s}$$

$$Clock Cycles_{A} = CPU Time_{A} \times Clock Rate_{A}$$

$$= 10s \times 2GHz = 20 \times 10^{9}$$

$$Clock Rate_{B} = \frac{1.2 \times 20 \times 10^{9}}{6s} = \frac{24 \times 10^{9}}{6s} = 4GHz$$

## Instruction Count and CPI



Clock Cycles = Instruction Count × Cycles per Instruction

CPU Time = Instruction Count × CPI × Clock Cycle Time

Instruction Count × CPI

**Clock Rate** 

- Instruction Count for a program
  - > Determined by program, ISA and compiler
- Average cycles per instruction
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix

## **CPI Example**



- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- > Which is faster, and by how much?

 $\begin{array}{l} \mathsf{CPU Time}_{\mathsf{A}} = \mathsf{Instruction Count} \times \mathsf{CPl}_{\mathsf{A}} \times \mathsf{Cycle Time}_{\mathsf{A}} \\ = \mathsf{I} \times 2.0 \times 250 \mathsf{ps} = \mathsf{I} \times 500 \mathsf{ps} & \quad \mathsf{A is faster...} \\ \mathsf{CPU Time}_{\mathsf{B}} = \mathsf{Instruction Count} \times \mathsf{CPl}_{\mathsf{B}} \times \mathsf{Cycle Time}_{\mathsf{B}} \\ = \mathsf{I} \times 1.2 \times 500 \mathsf{ps} = \mathsf{I} \times 600 \mathsf{ps} \\ \mathsf{CPU Time}_{\mathsf{A}} = \frac{\mathsf{I} \times 600 \mathsf{ps}}{\mathsf{I} \times 500 \mathsf{ps}} = 1.2 & \quad \text{...by this much} \end{array}$ 

## **CPI in More Detail**



 If different instruction types take different numbers of cycles

Clock Cycles = 
$$\sum_{i=1}^{n} (CPI_i \times Instruction Count_i)$$

Weighted average CPI



## **CPI Example**



 Alternative compiled code sequences using instructions in type INT, FP, MEM

| Туре            | INT | FP | MEM |
|-----------------|-----|----|-----|
| CPI for type    | 1   | 2  | 3   |
| IC in Program 1 | 2   | 1  | 2   |
| IC in Program 2 | 4   | 1  | 1   |

- Program 1: IC = 5
  - Clock Cycles
     = 2×1 + 1×2 + 2×3
     = 10
  - Avg. CPI = 10/5 = 2.0

- Program 2: IC = 6
  - Clock Cycles
     = 4×1 + 1×2 + 1×3
     = 9
  - Avg. CPI = 9/6 = 1.5

#### **Performance Summary**



#### **The BIG Picture**



- > Performance depends on
  - > Algorithm: affects IC, possibly CPI
  - Programming language: affects IC, CPI
  - Compiler: affects IC, CPI
  - Instruction set architecture: affects IC, CPI, T<sub>c</sub>

#### **Power Trends**





In CMOS IC technology



## **Reducing Power**



- Suppose a new CPU has
  - > 85% of capacitive load of old CPU
  - > 15% voltage and 15% frequency reduction

$$\frac{P_{\text{new}}}{P_{\text{old}}} = \frac{C_{\text{old}} \times 0.85 \times (V_{\text{old}} \times 0.85)^2 \times F_{\text{old}} \times 0.85}{C_{\text{old}} \times V_{\text{old}}^2 \times F_{\text{old}}} = 0.85^4 = 0.52$$

- The power wall
  - We can't reduce voltage further
  - We can't remove more heat
- How else can we improve performance?

## **Multiprocessors**



- Multicore microprocessors
  - More than one processor per chip
- Requires explicitly parallel programming
  - Compare with instruction level parallelism
    - Hardware executes multiple instructions at once
    - Hidden from the programmer
  - > Hard to do
    - Programming for performance
    - Load balancing
    - Optimizing communication and synchronization

#### **AMD Opteron X2 Wafer**





X2: 300mm wafer, 117 chips, 90nm technology
X4: 45nm technology

## **Manufacturing ICs**





Yield: proportion of working dies per wafer

## **Integrated Circuit Cost**



Cost per die =  $\frac{\text{Cost per wafer}}{\text{Dies per wafer } \times \text{Yield}}$ Dies per wafer  $\approx$  Wafer area/Die area Yield =  $\frac{1}{(1+(\text{Defects per area} \times \text{Die area}/2))^2}$ 

- Nonlinear relation to area and defect rate
  - Wafer cost and area are fixed
  - Defect rate determined by manufacturing process
  - Die area determined by architecture and circuit design

## **SPEC CPU Benchmark**



- Programs used to measure performance
  - Supposedly typical of actual workload
- Standard Performance Evaluation Corp (SPEC)
  - > Develops benchmarks for CPU, I/O, Web, ...
- > SPEC CPU2006
  - Elapsed time to execute a selection of programs
    - Negligible I/O, so focuses on CPU performance
  - Normalize relative to reference machine
  - Summarize as geometric mean of performance ratios
    - CINT2006 (integer) and CFP2006 (floating-point)



# CINT2006 for Opteron X4 2356

| Name           | Description                   | IC×10 <sup>9</sup> | CPI   | Tc (ns) | Exec time | Ref time | SPECratio |
|----------------|-------------------------------|--------------------|-------|---------|-----------|----------|-----------|
| perl           | Interpreted string processing | 2,118              | 0.75  | 0.40    | 637       | 9,777    | 15.3      |
| bzip2          | Block-sorting compression     | 2,389              | 0.85  | 0.40    | 817       | 9,650    | 11.8      |
| gcc            | GNU C Compiler                | 1,050              | 1.72  | 0.47    | 24        | 8,050    | 11.1      |
| mcf            | Combinatorial optimization    | 336                | 10.00 | 0.40    | 1,345     | 9,120    | 6.8       |
| go             | Go game (AI)                  | 1,658              | 1.09  | 0.40    | 721       | 10,490   | 14.6      |
| hmmer          | Search gene sequence          | 2,783              | 0.80  | 0.40    | 890       | 9,330    | 10.5      |
| sjeng          | Chess game (AI)               | 2,176              | 0.96  | 0.48    | 37        | 12,100   | 14.5      |
| libquantum     | Quantum computer simulation   | 1,623              | 1.61  | 0.40    | 1,047     | 20,720   | 19.8      |
| h264avc        | Video compression             | 3,102              | 0.80  | 0.40    | 993       | 22,130   | 22.3      |
| omnetpp        | Discrete event simulation     | 587                | 2.94  | 0.40    | 690       | 6,250    | 9.1       |
| astar          | Games/path finding            | 1,082              | 1.79  | 0.40    | 773       | 7,020    | 9.1       |
| xalancbmk      | XML parsing                   | 1,058              | 2.70  | 0.40    | 1,143     | 6,900    | 6.0       |
| Geometric mean |                               |                    |       |         | 11.7      |          |           |

High cache miss rates

#### **SPEC Power Benchmark**



- Power consumption of server at different workload levels
  - > Performance: ssj\_ops/sec
  - > Power: Watts (Joules/sec)

Overall ssj\_ops per Watt = 
$$\left(\sum_{i=0}^{10} ssj_ops_i\right) / \left(\sum_{i=0}^{10} power_i\right)$$

## SPECpower\_ssj2008 for X4



| Target Load %    | Performance (ssj_ops/sec) | Average Power (Watts) |
|------------------|---------------------------|-----------------------|
| 100%             | 231,867                   | 295                   |
| 90%              | 211,282                   | 286                   |
| 80%              | 185,803                   | 275                   |
| 70%              | 163,427                   | 265                   |
| 60%              | 140,160                   | 256                   |
| 50%              | 118,324                   | 246                   |
| 40%              | 920,35                    | 233                   |
| 30%              | 70,500                    | 222                   |
| 20%              | 47,126                    | 206                   |
| 10%              | 23,066                    | 180                   |
| 0%               | 0                         | 141                   |
| Overall sum      | 1,283,590                 | 2,605                 |
| ∑ssj_ops/ ∑power |                           | 493                   |

## Fallacy: Low Power at Idle

- Look back at X4 power benchmark
  - > At 100% load: 295W
  - > At 50% load: 246W (83%)
  - > At 10% load: 180W (61%)
- Google data center
  - Mostly operates at 10% 50% load
  - > At 100% load less than 1% of the time
- Consider designing processors to make power proportional to load



## Pitfall: Amdahl's Law



 Improving an aspect of a computer and expecting a proportional improvement in overall performance



- Example: multiply accounts for 80s/100s
  - How much improvement in multiply performance to get 5× overall?

$$20 = \frac{80}{n} + 20$$
 • Can't be done!

Corollary: make the common case fast

#### Pitfall: MIPS as a Performance Metric

- MIPS: Millions of Instructions Per Second
  - > Doesn't account for
    - Differences in ISAs between computers
    - Differences in complexity between instructions



CPI varies between programs on a given CPU

## **Concluding Remarks**



- Cost/performance is improving
  - > Due to underlying technology development
- > Hierarchical layers of abstraction
  - In both hardware and software
- Instruction set architecture
  - The hardware/software interface
- Execution time: the best performance measure
- > Power is a limiting factor
  - > Use parallelism to improve performance