问题描述:

tl;dr: double b=a-(size_t)(a) faster than double b=a-trunc(a)

I am implementing a rotation function for an image and I noticed that the trunc function seems to be awfully slow.

Looping code for the image, the actual affectation of the pixels is commented out for the performance test so I don't even access the pixels.

double sina(sin(angle)), cosa(cos(angle));

int h = (int) (_in->h*cosa + _in->w*sina);

int w = (int) (_in->w*cosa + _in->h*sina);

int offsetx = (int)(_in->h*sina);

SDL_Surface* out = SDL_CreateARGBSurface(w, h); //wrapper over SDL_CreateRGBSurface

SDL_FillRect(out, NULL, 0x0);//transparent black

for (int y = 0; y < _in->h; y++)

for (int x = 0; x < _in->w; x++){

//calculate the new position

const double destY = y*cosa + x*sina;

const double destX = x*cosa - y*sina + offsetx;

So here is the code using trunc

size_t tDestX = (size_t) trunc(destX);

size_t tDestY = (size_t) trunc(destY);

double left = destX - trunc(destX);

double top = destY - trunc(destY);

And here is the faster equivalent

size_t tDestX = (size_t)(destX);

size_t tDestY = (size_t)(destY);

double left = destX - tDestX;

double top = destY - tDestY;

The answers suggest not to use trunc when converting back to integral so I also tried that case:

size_t tDestX = (size_t) (destX);

size_t tDestY = (size_t) (destY);

double left = destX - trunc(destX);

double top = destY - trunc(destY);

The fast version seems to take an average of 30ms to go through the full image (2048x1200) while the slow version using trunc takes about 135ms for the same image. The version with only two calls to trunc is still much slower than the one without (about 100ms).

As far as I understand C++ rules, both expressions should return always the same thing. Am I missing something here? dextX and destY are declared const so only one call should be made to the trunc function and even then it wouldn't explain the over three times slower factor by itself.

I'm compiling with Visual Studio 2013 with optimizations (/O2). Is there any reason to use the trunc function at all? Even for getting the fractional part using an integer seems to be faster.

网友答案:

On modern x86 CPUs, int <-> float conversions are quite fast - typically inline SSE code is generated for the conversion and the cost is of the order of a few instruction cycles.1

For trunc however a function call is required, and the function call overhead alone is almost certainly greater than than the cost of an inline float -> int conversion. Furthermore, the trunc function itself may be relatively costly - it has to be fully IEEE-754 compliant, so the full range of floating point values has to be dealt with correctly, as do edge cases such as NaN, INF, denorms, values which are out of range, etc. So overall I would expect the cost of trunc to be of the order of tens of instruction cycles, i.e. an order of magnitude or so greater than the cost of an inline float -> int conversion.


1. Note that float <-> int conversions are not always inexpensive - other CPU families, and even older x86 CPUs, may not have ISA support for such conversions, in which case a library function will normally be used, and the cost of this would be similar to that of trunc. Modern x86 CPUs are a special case in this regard.

网友答案:

The way you're using it, there's no reason for you to use the trunc function at all. It transforms a double into a double, which you then cast into an integral and throw away. The fact that the alternative is faster, is not that surprising.

相关阅读:
Top