tl;dr: `double b=a-(size_t)(a)` faster than `double b=a-trunc(a)`

I am implementing a rotation function for an image and I noticed that the `trunc` function seems to be awfully slow.

Looping code for the image, the actual affectation of the pixels is commented out for the performance test so I don't even access the pixels.

``double sina(sin(angle)), cosa(cos(angle));int h = (int) (_in->h*cosa + _in->w*sina);int w = (int) (_in->w*cosa + _in->h*sina);int offsetx = (int)(_in->h*sina);SDL_Surface* out = SDL_CreateARGBSurface(w, h); //wrapper over SDL_CreateRGBSurfaceSDL_FillRect(out, NULL, 0x0);//transparent blackfor (int y = 0; y < _in->h; y++)for (int x = 0; x < _in->w; x++){//calculate the new positionconst double destY = y*cosa + x*sina;const double destX = x*cosa - y*sina + offsetx;``

So here is the code using `trunc`

``size_t tDestX = (size_t) trunc(destX);size_t tDestY = (size_t) trunc(destY);double left = destX - trunc(destX);double top = destY - trunc(destY);``

And here is the faster equivalent

``size_t tDestX = (size_t)(destX);size_t tDestY = (size_t)(destY);double left = destX - tDestX;double top = destY - tDestY;``

The answers suggest not to use `trunc` when converting back to integral so I also tried that case:

``size_t tDestX = (size_t) (destX);size_t tDestY = (size_t) (destY);double left = destX - trunc(destX);double top = destY - trunc(destY);``

The fast version seems to take an average of 30ms to go through the full image (2048x1200) while the slow version using `trunc` takes about 135ms for the same image. The version with only two calls to `trunc` is still much slower than the one without (about 100ms).

As far as I understand C++ rules, both expressions should return always the same thing. Am I missing something here? `dextX` and `destY` are declared `const` so only one call should be made to the `trunc` function and even then it wouldn't explain the over three times slower factor by itself.

I'm compiling with Visual Studio 2013 with optimizations (/O2). Is there any reason to use the `trunc` function at all? Even for getting the fractional part using an integer seems to be faster.

On modern x86 CPUs, int <-> float conversions are quite fast - typically inline SSE code is generated for the conversion and the cost is of the order of a few instruction cycles.1

For `trunc` however a function call is required, and the function call overhead alone is almost certainly greater than than the cost of an inline float -> int conversion. Furthermore, the `trunc` function itself may be relatively costly - it has to be fully IEEE-754 compliant, so the full range of floating point values has to be dealt with correctly, as do edge cases such as NaN, INF, denorms, values which are out of range, etc. So overall I would expect the cost of `trunc` to be of the order of tens of instruction cycles, i.e. an order of magnitude or so greater than the cost of an inline float -> int conversion.

1. Note that float <-> int conversions are not always inexpensive - other CPU families, and even older x86 CPUs, may not have ISA support for such conversions, in which case a library function will normally be used, and the cost of this would be similar to that of `trunc`. Modern x86 CPUs are a special case in this regard.

The way you're using it, there's no reason for you to use the `trunc` function at all. It transforms a double into a double, which you then cast into an integral and throw away. The fact that the alternative is faster, is not that surprising.

Top