Media Function |
Data Set |
AltiVec |
Speedup |
Intel-MMX |
|
|
G4 Cycles |
(Optimized C PowerPC) |
Pentium |
|
|
SIM_G4 Verified
(1) |
|
Video : H.263 Functions |
8x8 Forward DCT |
(Scaled Chen Algorithm)
Input:
8x8 image (diff) pixel block
Data Size: 16-bits |
100 |
11.4 |
? |
(Lee Huang Algorithm)
Input: SP Float
Output: 32b integer |
252 (2) |
3.6 (2) |
? |
8x8 Inverse DCT
(Scaled Chen Algorithm) |
Input/Output:
8x8 image block
Data Size: 16-bits |
101.7 per block |
12.3 |
240 Cycles AAN
Algo. |
Motion Estimation |
176x144 pixels image block
Data Size: Bytes |
90.7 per 16x16 macroblock |
16 |
2x over Scalar |
Quantization |
Input:
8x8 DCT output block
INTER macroblock only
Input Data Size: 16-bits
Output Data Size: bytes |
96.8 |
12.5 |
? |
Dequantization |
Input:
8x8 block from VLC decode
INTER macroblock only
Input Data Size: bytes
Output Data Size: 16-bits |
44 |
11 |
? |
Color Space Conversion
(RGB <-> YCbCr)
(CCIR601 standard) |
RGB -> YCbCr
Input/Output Data Size: bytes |
2.3 / pixel |
9.6 |
? |
YCbCr -> RGB
Input/Output Data Size: bytes |
2.24 / pixel |
7 |
? |
Audio : Dolby AC3 Functions |
Inverse FFT
Bailey's Algorithm |
64 complex taps
128 SP Floats |
603 |
3.6 |
? |
128 complex taps
256 SP Floats |
1700 |
3.5 |
? |
IMDCT Function
Includes: i) IFFT
ii) pre- and iii) post-
processing functions |
Short blocks
256 SP Floats |
2008 |
4 |
? |
Normal blocks
256 SP Floats |
2526 |
3.8 |
? |
Windowing |
Input: 256 SP Floats (IFFT Stage Output)
Output: 256 halfwords:
PCM Output
Delay Buffer |
834 / kernel |
4.9 |
? |
3D Graphics |
Note: Scalar C code obtained from GNU
Messa Library |
Matrix-Vector Multiplication
Datatype: SP Float |
Input: 4x4 matrix and one 4-element vector |
17 |
3.7 |
? |
Input: 4x4 matrix and multiple 4-element vectors |
7.5 per vector |
8.0 |
? |
Matrix-Matrix Multiplication |
Input: 2 4x4 matrices of SP Floats
Output: 4x4 matrix of SP Floats |
36.5 |
6.2 |
? |
Bresenham Line Drawing
Strictly serial algorithm due to OpenGL interface
Much better parallelizable along lines |
Input: x,y co-ordinates of 2 points : 16 bits
Output: 8 consecutive points on same line |
3.23 / pixel |
1.5-2.1
Depending on slope |
? |
Line Color Interpolation
Most time consuming OpenGL kernel in Wireframe Animation
24-bit precision (OpenGL: 19-bits) |
Input: Initial and Delta Color Vectors [R G B A]
Output: Interpolated Color Vectors |
2.71 / pixel |
2.9 |
? |
Buffer Accumulation
Used in Anti-Aliasing |
Input: Pixel vectors [R G B A]: bytes
Four scale factors: SP Floats
Input-Output: Per Pixel Accumulate vectors
[R' G' B' A']: halfwords |
5.3 / pixel |
17.5 |
? |
Line Clipping (2D)
(Liang-Barsky Algorithm) |
Input:
clip region coordinates: SP Floats
set of line vectors (x0,y0,x1,y1): SP Floats
Output:
set of clipped line vectors (x0,y0,x1,y1): SP Floats |
28.5 / line |
6.6 |
? |
Bezier Curve Drawing
(Casteljau's Algorithm) |
Input:
4 points (x,y) of curve control: halfwords
Output:
64 points of the same curve: halfwords |
2.48 / output_point |
6.3 |
? |
Image Effects |
Separable Convolution (3x3)
16-bit kernel coefficients |
128 x 128 pixels |
1.09 / pixel |
? |
Video Loop Filter: 5.5 |
256 x 256 pixels |
1.93 / pixel |
? |
? |
512 x 512 pixels |
1.94 / pixel |
? |
? |
1024 x 1024 pixels |
2.25 / pixel |
? |
? |
Color Space Conversion
RGB to YUV |
4800 pixels |
2.25 / pixel |
? |
8 cycles/pixel |
Bilinear Interpolation
Part of Texture Mapping:
Pixel color determination |
128 x 128 pixels |
26.7 / pixel |
6.4 |
66 cycles/pixel |
Median Filter (3x3)
Replaces center pixel in a 3x3 window by median of sorted pixels |
128 x 128 pixels |
1.23 / pixel |
? |
415 (!) cycles/pixel |
L-Filter (3x3): Order Statistic Filter
16-bit kernel coefficients
Replaces center pixel in a 3x3 window by weighted sum of sorted pixels |
128 x 128 pixels |
5.3 / pixel |
? |
? |
Communications (Modems and Telephony) |
Galois Field Multiplication |
Input: Multiplicands in add form:
Output: Result in add form |
(per multiply)
GF(16): 0.625
GF(256): 2.125 |
GF(16): 16
GF(256): 5 |
? |
64-QAM Demodulator : Bit Packing
Transform vector of complex symbols (mapped to final form) to contiguous
bitstream |
Input: 32 halfwords
each: 4-bits of I and Q comp for 2 symbols
Output: 12 words = 384 bits |
60 |
7.5 |
? |
CRC-32:
Standard Algorithm |
Input: 128b data
Output: 32b CRC |
96 |
1 |
|
CRC-32:
Kaplan's Algorithm |
Input: 128b data
Output: 32b CRC |
21 |
2.5? |
|
Linear Prediction (LP)
using Levinson-Durbin (LD)
Datatype: SP Float
MMX Notes:
16b fixed, A different flavor of LD |
4 LP Coeffs |
102 |
2.48 |
390? |
8 LP Coeffs |
234 |
2.73 |
944? |
12 LP Coeffs |
388 |
3.16 |
1666? |
16 LP Coeffs |
569 |
3.42 |
2552? |
Linear Prediction (LP)
using Schur Recursion
Datatype: SP Float
MMX: 16b fixed |
4 LP Coeffs |
64 |
4.78 |
299 |
8 LP Coeffs |
142 |
6.06 |
746 |
12 LP Coeffs |
238 |
7.08 |
1334 |
16 LP Coeffs |
366 |
7.61 |
2061 |
Autocorrelation:
Input:
256 unsigned byte signal samples
Output:
32b coeffs |
4 Coeffs |
276 |
18.1 |
|
8 Coeffs |
407 |
22.8 |
|
12 Coeffs |
543 |
25.6 |
|
16 Coeffs |
676 |
30.7 |
|
Long-Term Parameter
Computation
(GSM Module Number: 4.2.11) |
Input Datatype:
signed 16b
Input Data:
60600 samples |
1034 |
12.5 |
|
Miscellaneous Small Kernels |
Small Table Lookups, e.g.,
16-way parallel lookup |
Table-size: 16-256 elements
Element Size: byte |
2-19 |
20-2.5 |
? |
Sorting, e.g.,
Batcher Sort
Element Size: byte |
Input: Unsorted array
- 16 elements
Output: Sorted array |
76 / array |
10 |
? |
Input: 2 unsorted arrays
- 16 elements each
Output: 2 sorted arrays
- 16 elements each |
45.5 / array |
14 |
? |
Input: 2 sorted arrays
- 16 elements each
Output: 1 sorted array
- 32 elements |
31 |
2-4 |
? |
Gamma Correction
(ITU-R Recommendation 709) |
16 pixel values (0-255)
32-piecewise linear interpolation |
10 |
4 |
? |
Arbitrary 128-bit Permutation |
128-bit value |
20 |
? |
4 cycles per bit!! |
Associative Search |
Input:
Two 32 entry tables,
16b keys and 16b tags,
16b key to be looked
Output:
16b tag |
13 |
5.8 |
|
Gauss Elimination
for linear system
Datatype: SP Float |
4 Variables |
478 |
1.19 |
|
8 Variables |
2170 |
1.15 |
|
12 Variables |
5104 |
1.31 |
|
16 Variables |
9824 |
1.42 |
|
Haar Transform (forward) |
Input: 8 2x2 byte pixel blocks
Output: 8 sets of 4 frequency bands
Band Elements: 16-bits |
12 |
? |
48 cycles |