问题描述:

I am working on a language that is compiled with LLVM. Just for fun, I wanted to do some microbenchmarks. In one, I run some million sin / cos computations in a loop. In pseudocode, it looks like this:

`var x: Double = 0.0`

for (i <- 0 to 100 000 000)

x = sin(x)^2 + cos(x)^2

return x.toInteger

If I'm computing sin/cos using LLVM IR inline assembly in the form:

`%sc = call { double, double } asm "fsincos", "={st(1)},={st},1,~{dirflag},~{fpsr},~{flags}" (double %"res") nounwind`

this is faster than using fsin and fcos separately instead of fsincos. However, it is slower than if I calling the `llvm.sin.f64`

and `llvm.cos.f64`

intrinsics separately, which compile to calls to the C math lib functions, at least with the target settings I'm using (x86_64 with SSE enabled).

It seems LLVM inserts some conversions between single/double precision FP -- that might be the culprit. Why is that? Sorry, I'm a relative newbie at assembly:

`.globl main`

.align 16, 0x90

.type main,@function

main: # @main

.cfi_startproc

# BB#0: # %loopEntry1

xorps %xmm0, %xmm0

movl $-1, %eax

jmp .LBB44_1

.align 16, 0x90

.LBB44_2: # %then4

# in Loop: Header=BB44_1 Depth=1

movss %xmm0, -4(%rsp)

flds -4(%rsp)

#APP

fsincos

#NO_APP

fstpl -16(%rsp)

fstpl -24(%rsp)

movsd -16(%rsp), %xmm0

mulsd %xmm0, %xmm0

cvtsd2ss %xmm0, %xmm1

movsd -24(%rsp), %xmm0

mulsd %xmm0, %xmm0

cvtsd2ss %xmm0, %xmm0

addss %xmm1, %xmm0

.LBB44_1: # %loop2

# =>This Inner Loop Header: Depth=1

incl %eax

cmpl $99999999, %eax # imm = 0x5F5E0FF

jle .LBB44_2

# BB#3: # %break3

cvttss2si %xmm0, %eax

ret

.Ltmp160:

.size main, .Ltmp160-main

.cfi_endproc

Same test with calls to llvm sin/cos intrinsics:

`.globl main`

.align 16, 0x90

.type main,@function

main: # @main

.cfi_startproc

# BB#0: # %loopEntry1

pushq %rbx

.Ltmp162:

.cfi_def_cfa_offset 16

subq $16, %rsp

.Ltmp163:

.cfi_def_cfa_offset 32

.Ltmp164:

.cfi_offset %rbx, -16

xorps %xmm0, %xmm0

movl $-1, %ebx

jmp .LBB44_1

.align 16, 0x90

.LBB44_2: # %then4

# in Loop: Header=BB44_1 Depth=1

movsd %xmm0, (%rsp) # 8-byte Spill

callq cos

mulsd %xmm0, %xmm0

movsd %xmm0, 8(%rsp) # 8-byte Spill

movsd (%rsp), %xmm0 # 8-byte Reload

callq sin

mulsd %xmm0, %xmm0

addsd 8(%rsp), %xmm0 # 8-byte Folded Reload

.LBB44_1: # %loop2

# =>This Inner Loop Header: Depth=1

incl %ebx

cmpl $99999999, %ebx # imm = 0x5F5E0FF

jle .LBB44_2

# BB#3: # %break3

cvttsd2si %xmm0, %eax

addq $16, %rsp

popq %rbx

ret

.Ltmp165:

.size main, .Ltmp165-main

.cfi_endproc

Can you suggest how the ideal assembly would look like with fsincos? PS. Adding -enable-unsafe-fp-math to llc makes the conversions disappear and switches to doubles (fldl etc.), but the speed remains the same.

`.globl main`

.align 16, 0x90

.type main,@function

main: # @main

.cfi_startproc

# BB#0: # %loopEntry1

xorps %xmm0, %xmm0

movl $-1, %eax

jmp .LBB44_1

.align 16, 0x90

.LBB44_2: # %then4

# in Loop: Header=BB44_1 Depth=1

movsd %xmm0, -8(%rsp)

fldl -8(%rsp)

#APP

fsincos

#NO_APP

fstpl -24(%rsp)

fstpl -16(%rsp)

movsd -24(%rsp), %xmm1

mulsd %xmm1, %xmm1

movsd -16(%rsp), %xmm0

mulsd %xmm0, %xmm0

addsd %xmm1, %xmm0

.LBB44_1: # %loop2

# =>This Inner Loop Header: Depth=1

incl %eax

cmpl $99999999, %eax # imm = 0x5F5E0FF

jle .LBB44_2

# BB#3: # %break3

cvttsd2si %xmm0, %eax

ret

.Ltmp160:

.size main, .Ltmp160-main

.cfi_endproc

Too many documents claim that x87 instructions like `fsin`

or `fsincos`

are the fastest way to do trigonometric functions. Those claims are often wrong.

The fastest way depends on your CPU. As CPUs become faster, old hardware trig instructions like `fsin`

have not kept pace. With some CPUs, a software function, using a polynomial approximation for sine or another trig function, is now faster than a hardware instruction.

In short, `fsincos`

is too slow.

There is enough evidence that the x86-64 platform has moved away from hardware trig.

- amd64 prefers SSE over x87 for floats. Yet, SSE has no equivalents for x87 instructions like
`fsin`

. - For amd64, libm in both FreeBSD and glibc implement sin() and such functions in software, not with x87 trig. glibc has optimized x86-64 assembly for sinf() (the single-precision sine) with a polynomial approximation, not with x87
`fsin`

. NetBSD and OpenBSD made the opposite choice: their libm for amd64 does use x87 instructions. - Steel Bank Common Lisp uses
`fsin`

in its x86 backend but not in its x86-64 backend. For x86-64, SBCL compiles code that calls sin() in libm.

I timed hardware and software sine on an AMD Phenom II X2 560 (3.3 GHz) from 2010. I wrote a C program with this loop:

```
volatile double a, s;
/* ... */
for (i = 0; i < 100000000; i++)
s = sin(a);
```

I compiled this program twice, with two different implementations of sin(). The hard sin() uses x87 `fsin`

. The soft sin() uses a polynomial approximation. My C compiler, `gcc -O2`

, did not replace my sin() call with an inline `fsin`

.

Here are results for sin(0.5):

```
$ time race-hard 0.5
0m3.40s real 0m3.40s user 0m0.00s system
$ time race-soft 0.5
0m1.13s real 0m1.15s user 0m0.00s system
```

Here soft sin(0.5) is so fast, this CPU would do soft sin(0.5) and soft cos(0.5) faster than one x87 `fsin`

.

And for sin(123):

```
$ time race-hard 123
0m3.61s real 0m3.62s user 0m0.00s system
$ time race-soft 123
0m3.08s real 0m3.07s user 0m0.01s system
```

Soft sin(123) is slower than soft sin(0.5) because 123 is too large for the polynomial, so the function must subtract some multiple of 2π. If I also want cos(123), there is a chance that x87 `fsincos`

would be faster than soft sin(123) and soft cos(123), for this CPU from 2010.