The Mac Set

AltiVec Performance Comparison Table

Media Function	Data Set	AltiVec	Speedup	Intel-MMX
		G4 Cycles	(Optimized C PowerPC)	Pentium
		SIM_G4 Verified ⁽¹⁾
Video : H.263 Functions
8x8 Forward DCT	(Scaled Chen Algorithm) Input: 8x8 image (diff) pixel block Data Size: 16-bits	100	11.4	?
8x8 Forward DCT	(Lee Huang Algorithm) Input: SP Float Output: 32b integer	252 ⁽²⁾	3.6 ⁽²⁾	?
8x8 Inverse DCT (Scaled Chen Algorithm)	Input/Output: 8x8 image block Data Size: 16-bits	101.7 per block	12.3	240 Cycles AAN Algo.
Motion Estimation	176x144 pixels image block Data Size: Bytes	90.7 per 16x16 macroblock	16	2x over Scalar
Quantization	Input: 8x8 DCT output block INTER macroblock only Input Data Size: 16-bits Output Data Size: bytes	96.8	12.5	?
Dequantization	Input: 8x8 block from VLC decode INTER macroblock only Input Data Size: bytes Output Data Size: 16-bits	44	11	?
Color Space Conversion (RGB <-> YCbCr) (CCIR601 standard)	RGB -> YCbCr Input/Output Data Size: bytes	2.3 / pixel	9.6	?
Color Space Conversion (RGB <-> YCbCr) (CCIR601 standard)	YCbCr -> RGB Input/Output Data Size: bytes	2.24 / pixel	7	?
Audio : Dolby AC3 Functions
Inverse FFT Bailey's Algorithm	64 complex taps 128 SP Floats	603	3.6	?
Inverse FFT Bailey's Algorithm	128 complex taps 256 SP Floats	1700	3.5	?
IMDCT Function Includes: i) IFFT ii) pre- and iii) post- processing functions	Short blocks 256 SP Floats	2008	4	?
	Normal blocks 256 SP Floats	2526	3.8	?
Windowing	Input: 256 SP Floats (IFFT Stage Output) Output: 256 halfwords: PCM Output Delay Buffer	834 / kernel	4.9	?
3D Graphics
Note: Scalar C code obtained from GNU Messa Library
Matrix-Vector Multiplication Datatype: SP Float	Input: 4x4 matrix and one 4-element vector	17	3.7	?
Matrix-Vector Multiplication Datatype: SP Float	Input: 4x4 matrix and multiple 4-element vectors	7.5 per vector	8.0	?
Matrix-Matrix Multiplication	Input: 2 4x4 matrices of SP Floats Output: 4x4 matrix of SP Floats	36.5	6.2	?
Bresenham Line Drawing Strictly serial algorithm due to OpenGL interface Much better parallelizable along lines	Input: x,y co-ordinates of 2 points : 16 bits Output: 8 consecutive points on same line	3.23 / pixel	1.5-2.1 Depending on slope	?
Line Color Interpolation Most time consuming OpenGL kernel in Wireframe Animation 24-bit precision (OpenGL: 19-bits)	Input: Initial and Delta Color Vectors [R G B A] Output: Interpolated Color Vectors	2.71 / pixel	2.9	?
Buffer Accumulation Used in Anti-Aliasing	Input: Pixel vectors [R G B A]: bytes Four scale factors: SP Floats Input-Output: Per Pixel Accumulate vectors [R' G' B' A']: halfwords	5.3 / pixel	17.5	?
Line Clipping (2D) (Liang-Barsky Algorithm)	Input: clip region coordinates: SP Floats set of line vectors (x0,y0,x1,y1): SP Floats Output: set of clipped line vectors (x0,y0,x1,y1): SP Floats	28.5 / line	6.6	?
Bezier Curve Drawing (Casteljau's Algorithm)	Input: 4 points (x,y) of curve control: halfwords Output: 64 points of the same curve: halfwords	2.48 / output_point	6.3	?
Image Effects
Separable Convolution (3x3) 16-bit kernel coefficients	128 x 128 pixels	1.09 / pixel	?	Video Loop Filter: 5.5
	256 x 256 pixels	1.93 / pixel	?	?
	512 x 512 pixels	1.94 / pixel	?	?
	1024 x 1024 pixels	2.25 / pixel	?	?
Color Space Conversion RGB to YUV	4800 pixels	2.25 / pixel	?	8 cycles/pixel
Bilinear Interpolation Part of Texture Mapping: Pixel color determination	128 x 128 pixels	26.7 / pixel	6.4	66 cycles/pixel
Median Filter (3x3) Replaces center pixel in a 3x3 window by median of sorted pixels	128 x 128 pixels	1.23 / pixel	?	415 (!) cycles/pixel
L-Filter (3x3): Order Statistic Filter 16-bit kernel coefficients Replaces center pixel in a 3x3 window by weighted sum of sorted pixels	128 x 128 pixels	5.3 / pixel	?	?
Communications (Modems and Telephony)
Galois Field Multiplication	Input: Multiplicands in add form: Output: Result in add form	(per multiply) GF(16): 0.625 GF(256): 2.125	GF(16): 16 GF(256): 5	?
64-QAM Demodulator : Bit Packing Transform vector of complex symbols (mapped to final form) to contiguous bitstream	Input: 32 halfwords each: 4-bits of I and Q comp for 2 symbols Output: 12 words = 384 bits	60	7.5	?
CRC-32: Standard Algorithm	Input: 128b data Output: 32b CRC	96	1
CRC-32: Kaplan's Algorithm	Input: 128b data Output: 32b CRC	21	2.5?
Linear Prediction (LP) using Levinson-Durbin (LD) Datatype: SP Float MMX Notes: 16b fixed, A different flavor of LD	4 LP Coeffs	102	2.48	390?
	8 LP Coeffs	234	2.73	944?
	12 LP Coeffs	388	3.16	1666?
	16 LP Coeffs	569	3.42	2552?
Linear Prediction (LP) using Schur Recursion Datatype: SP Float MMX: 16b fixed	4 LP Coeffs	64	4.78	299
	8 LP Coeffs	142	6.06	746
	12 LP Coeffs	238	7.08	1334
	16 LP Coeffs	366	7.61	2061
Autocorrelation: Input: 256 unsigned byte signal samples Output: 32b coeffs	4 Coeffs	276	18.1
	8 Coeffs	407	22.8
	12 Coeffs	543	25.6
	16 Coeffs	676	30.7
Long-Term Parameter Computation (GSM Module Number: 4.2.11)	Input Datatype: signed 16b Input Data: 60600 samples	1034	12.5
Miscellaneous Small Kernels
Small Table Lookups, e.g., 16-way parallel lookup	Table-size: 16-256 elements Element Size: byte	2-19	20-2.5	?
Sorting, e.g., Batcher Sort Element Size: byte	Input: Unsorted array - 16 elements Output: Sorted array	76 / array	10	?
	Input: 2 unsorted arrays - 16 elements each Output: 2 sorted arrays - 16 elements each	45.5 / array	14	?
	Input: 2 sorted arrays - 16 elements each Output: 1 sorted array - 32 elements	31	2-4	?
Gamma Correction (ITU-R Recommendation 709)	16 pixel values (0-255) 32-piecewise linear interpolation	10	4	?
Arbitrary 128-bit Permutation	128-bit value	20	?	4 cycles per bit!!
Associative Search	Input: Two 32 entry tables, 16b keys and 16b tags, 16b key to be looked Output: 16b tag	13	5.8
Gauss Elimination for linear system Datatype: SP Float	4 Variables	478	1.19
	8 Variables	2170	1.15
	12 Variables	5104	1.31
	16 Variables	9824	1.42
Haar Transform (forward)	Input: 8 2x2 byte pixel blocks Output: 8 sets of 4 frequency bands Band Elements: 16-bits	12	?	48 cycles

¹ Instruction class latencies: permute: 2, simple-fixed: 1, complex-fixed: 2, float: 3, L1 Size: 32KB, D-cache and I-cache, each; L1-L2 interface: 1/2 processor clock, 128b data transfer; L2-Mem interface: 1/4 processor clock, 128b data transfer; L2 Size: 2MB, L2 Latency: 6-2 processor clocks; Memory Type: SDRAM; Mem Latency: 20-4 (44-4) processor clock for page hit (miss).

² Compiler claimed cycle count, not yet SIM_G4 verified

BACK TO THE MAC SET

Contact:
Return to ianman HOME | Return to top