问题描述:

I am currently getting started with SSE.

The answer to my previous question regarding SSE ( Mutiplying vector by constant using SSE ) brought me to the idea to test the difference between using intrinsics like _mm_mul_ps()and just using 'normal operators' (not sure what the best term is) like *.

So i wrote two testing cases which only differ in way the result is calculated:

Method 1:

int main(void){

float4 a, b, c;

a.v = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);

b.v = _mm_set_ps(-1.0f, -2.0f, -3.0f, -4.0f);

printf("method 1\n");

c.v = a.v + b.v; // <---

print_vector(a);

print_vector(b);

printf("1.a) Computed output 1: ");

print_vector(c);

exit(EXIT_SUCCESS);

}

Method 2:

int main(void){

float4 a, b, c;

a.v = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);

b.v = _mm_set_ps(-1.0f, -2.0f, -3.0f, -4.0f);

printf("\nmethod 2\n");

c.v = _mm_add_ps(a.v, b.v); // <---

print_vector(a);

print_vector(b);

printf("1.b) Computed output 2: ");

print_vector(c);

exit(EXIT_SUCCESS);

}

both testing cases share the following:

typedef union float4{

__m128 v;

float x,y,z,w;

} float4;

void print_vector (float4 v){

printf("%f,%f,%f,%f\n", v.x, v.y, v.z, v.w);

}

So to compare the code generated for both cases i compiled using:

gcc -ggdb -msse -c t_vectorExtensions_method1.c

Which resulted in (showing only the part where the two vectors are added -which differs):

Method 1:

 c.v = a.v + b.v;

a1: 0f 57 c9 xorps %xmm1,%xmm1

a4: 0f 12 4d d0 movlps -0x30(%rbp),%xmm1

a8: 0f 16 4d d8 movhps -0x28(%rbp),%xmm1

ac: 0f 57 c0 xorps %xmm0,%xmm0

af: 0f 12 45 c0 movlps -0x40(%rbp),%xmm0

b3: 0f 16 45 c8 movhps -0x38(%rbp),%xmm0

b7: 0f 58 c1 addps %xmm1,%xmm0

ba: 0f 13 45 b0 movlps %xmm0,-0x50(%rbp)

be: 0f 17 45 b8 movhps %xmm0,-0x48(%rbp)

Method 2:

 c.v = _mm_add_ps(a.v, b.v);

a1: 0f 57 c0 xorps %xmm0,%xmm0

a4: 0f 12 45 a0 movlps -0x60(%rbp),%xmm0

a8: 0f 16 45 a8 movhps -0x58(%rbp),%xmm0

ac: 0f 57 c9 xorps %xmm1,%xmm1

af: 0f 12 4d b0 movlps -0x50(%rbp),%xmm1

b3: 0f 16 4d b8 movhps -0x48(%rbp),%xmm1

b7: 0f 13 4d f0 movlps %xmm1,-0x10(%rbp)

bb: 0f 17 4d f8 movhps %xmm1,-0x8(%rbp)

bf: 0f 13 45 e0 movlps %xmm0,-0x20(%rbp)

c3: 0f 17 45 e8 movhps %xmm0,-0x18(%rbp)

/* Perform the respective operation on the four SPFP values in A and B. */

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))

_mm_add_ps (__m128 __A, __m128 __B)

{

return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);

c7: 0f 57 c0 xorps %xmm0,%xmm0

ca: 0f 12 45 e0 movlps -0x20(%rbp),%xmm0

ce: 0f 16 45 e8 movhps -0x18(%rbp),%xmm0

d2: 0f 57 c9 xorps %xmm1,%xmm1

d5: 0f 12 4d f0 movlps -0x10(%rbp),%xmm1

d9: 0f 16 4d f8 movhps -0x8(%rbp),%xmm1

dd: 0f 58 c1 addps %xmm1,%xmm0

e0: 0f 13 45 90 movlps %xmm0,-0x70(%rbp)

e4: 0f 17 45 98 movhps %xmm0,-0x68(%rbp)

Obviously the code generated when using the intrinsic _mm_add_ps() is much larger. Why is this? Shouldn't it result in better code?

网友答案:

All that really matters is the addps. In a more realistic use case, where you might be, say, adding two large vectors of floats in a loop, the body of the loop will just contain addps, two loads and a store, and some scalar integer instructions for address arithmetic. On a modern superscalar CPU many of these instructions will execute in parallel.

Note also that you're compiling with optimisation disabled, so you won't get particularly efficient code. Try gcc -O3 -msse3 ....

相关阅读:
Top