navmenu

spacer

spacer

The Mac Set


AltiVec Performance Comparison Table

Media Function Data Set AltiVec Speedup Intel-MMX
G4 Cycles (Optimized C PowerPC) Pentium
SIM_G4 Verified (1)
Video : H.263 Functions
8x8 Forward DCT (Scaled Chen Algorithm)
Input:
8x8 image (diff) pixel block
Data Size: 16-bits
100 11.4 ?
(Lee Huang Algorithm)
Input: SP Float
Output: 32b integer
252 (2) 3.6 (2) ?
8x8 Inverse DCT
(Scaled Chen Algorithm)
Input/Output:
8x8 image block
Data Size: 16-bits
101.7 per block 12.3 240 Cycles

AAN Algo.
Motion Estimation 176x144 pixels image block
Data Size: Bytes
90.7 per 16x16 macroblock 16 2x over Scalar
Quantization Input:
8x8 DCT output block
INTER macroblock only
Input Data Size: 16-bits
Output Data Size: bytes
96.8 12.5 ?
Dequantization Input:
8x8 block from VLC decode
INTER macroblock only
Input Data Size: bytes
Output Data Size: 16-bits
44 11 ?
Color Space Conversion
(RGB <-> YCbCr)
(CCIR601 standard)
RGB -> YCbCr

Input/Output Data Size: bytes
2.3 / pixel 9.6 ?
YCbCr -> RGB

Input/Output Data Size: bytes
2.24 / pixel 7 ?
Audio : Dolby AC3 Functions
Inverse FFT
Bailey's Algorithm
64 complex taps
128 SP Floats
603 3.6 ?
128 complex taps
256 SP Floats
1700 3.5 ?
IMDCT Function
Includes: i) IFFT
ii) pre- and iii) post-
processing functions
Short blocks
256 SP Floats
2008 4 ?
Normal blocks
256 SP Floats
2526 3.8 ?
Windowing Input: 256 SP Floats (IFFT Stage Output)
Output: 256 halfwords:
PCM Output
Delay Buffer
834 / kernel 4.9 ?
3D Graphics
Note: Scalar C code obtained from GNU Messa Library
Matrix-Vector Multiplication
Datatype: SP Float
Input: 4x4 matrix and one 4-element vector 17 3.7 ?
Input: 4x4 matrix and multiple 4-element vectors 7.5 per vector 8.0 ?
Matrix-Matrix Multiplication Input: 2 4x4 matrices of SP Floats
Output: 4x4 matrix of SP Floats
36.5 6.2 ?
Bresenham Line Drawing
Strictly serial algorithm due to OpenGL interface
Much better parallelizable along lines
Input: x,y co-ordinates of 2 points : 16 bits
Output: 8 consecutive points on same line
3.23 / pixel 1.5-2.1


Depending on slope
?
Line Color Interpolation
Most time consuming OpenGL kernel in Wireframe Animation
24-bit precision (OpenGL: 19-bits)
Input: Initial and Delta Color Vectors [R G B A]
Output: Interpolated Color Vectors
2.71 / pixel 2.9 ?
Buffer Accumulation
Used in Anti-Aliasing
Input: Pixel vectors [R G B A]: bytes
Four scale factors: SP Floats
Input-Output: Per Pixel Accumulate vectors
[R' G' B' A']: halfwords
5.3 / pixel 17.5 ?
Line Clipping (2D)
(Liang-Barsky Algorithm)
Input:
clip region coordinates: SP Floats
set of line vectors (x0,y0,x1,y1): SP Floats
Output:
set of clipped line vectors (x0,y0,x1,y1): SP Floats
28.5 / line 6.6 ?
Bezier Curve Drawing
(Casteljau's Algorithm)
Input:
4 points (x,y) of curve control: halfwords
Output:
64 points of the same curve: halfwords
2.48 / output_point 6.3 ?
Image Effects
Separable Convolution (3x3)
16-bit kernel coefficients
128 x 128 pixels 1.09 / pixel ? Video Loop Filter: 5.5
256 x 256 pixels 1.93 / pixel ? ?
512 x 512 pixels 1.94 / pixel ? ?
1024 x 1024 pixels 2.25 / pixel ? ?
Color Space Conversion
RGB to YUV
4800 pixels 2.25 / pixel ? 8 cycles/pixel
Bilinear Interpolation
Part of Texture Mapping:
Pixel color determination
128 x 128 pixels 26.7 / pixel 6.4 66 cycles/pixel
Median Filter (3x3)
Replaces center pixel in a 3x3 window by median of sorted pixels
128 x 128 pixels 1.23 / pixel ? 415 (!) cycles/pixel
L-Filter (3x3): Order Statistic Filter
16-bit kernel coefficients
Replaces center pixel in a 3x3 window by weighted sum of sorted pixels
128 x 128 pixels 5.3 / pixel ? ?
Communications (Modems and Telephony)
Galois Field Multiplication Input: Multiplicands in add form:
Output: Result in add form
(per multiply)

GF(16): 0.625

GF(256): 2.125
GF(16): 16

GF(256): 5
?
64-QAM Demodulator : Bit Packing
Transform vector of complex symbols (mapped to final form) to contiguous bitstream
Input: 32 halfwords
each: 4-bits of I and Q comp for 2 symbols
Output: 12 words = 384 bits
60 7.5 ?
CRC-32:
Standard Algorithm
Input: 128b data
Output: 32b CRC
96 1
CRC-32:
Kaplan's Algorithm
Input: 128b data
Output: 32b CRC
21 2.5?
Linear Prediction (LP)
using Levinson-Durbin (LD)

Datatype: SP Float

MMX Notes:
16b fixed, A different flavor of LD
4 LP Coeffs 102 2.48 390?
8 LP Coeffs 234 2.73 944?
12 LP Coeffs 388 3.16 1666?
16 LP Coeffs 569 3.42 2552?
Linear Prediction (LP)
using Schur Recursion

Datatype: SP Float
MMX: 16b fixed
4 LP Coeffs 64 4.78 299
8 LP Coeffs 142 6.06 746
12 LP Coeffs 238 7.08 1334
16 LP Coeffs 366 7.61 2061
Autocorrelation:

Input:
256 unsigned byte signal samples
Output:
32b coeffs
4 Coeffs 276 18.1
8 Coeffs 407 22.8
12 Coeffs 543 25.6
16 Coeffs 676 30.7
Long-Term Parameter
Computation

(GSM Module Number: 4.2.11)
Input Datatype:
signed 16b

Input Data:
60600 samples
1034 12.5
Miscellaneous Small Kernels
Small Table Lookups, e.g.,
16-way parallel lookup
Table-size: 16-256 elements
Element Size: byte
2-19 20-2.5 ?
Sorting, e.g.,
Batcher Sort
Element Size: byte
Input: Unsorted array
- 16 elements
Output: Sorted array
76 / array 10 ?
Input: 2 unsorted arrays
- 16 elements each
Output: 2 sorted arrays
- 16 elements each
45.5 / array 14 ?
Input: 2 sorted arrays
- 16 elements each
Output: 1 sorted array
- 32 elements
31 2-4 ?
Gamma Correction
(ITU-R Recommendation 709)
16 pixel values (0-255)
32-piecewise linear interpolation
10 4 ?
Arbitrary 128-bit Permutation 128-bit value 20 ? 4 cycles per bit!!
Associative Search Input:
Two 32 entry tables,
16b keys and 16b tags,
16b key to be looked
Output:
16b tag
13 5.8
Gauss Elimination
for linear system
Datatype: SP Float
4 Variables 478 1.19
8 Variables 2170 1.15
12 Variables 5104 1.31
16 Variables 9824 1.42
Haar Transform (forward) Input: 8 2x2 byte pixel blocks
Output: 8 sets of 4 frequency bands
Band Elements: 16-bits
12 ? 48 cycles

1 Instruction class latencies: permute: 2, simple-fixed: 1, complex-fixed: 2, float: 3, L1 Size: 32KB, D-cache and I-cache, each; L1-L2 interface: 1/2 processor clock, 128b data transfer; L2-Mem interface: 1/4 processor clock, 128b data transfer; L2 Size: 2MB, L2 Latency: 6-2 processor clocks; Memory Type: SDRAM; Mem Latency: 20-4 (44-4) processor clock for page hit (miss).

2 Compiler claimed cycle count, not yet SIM_G4 verified


BACK TO THE MAC SET



spacer