问题描述:

So here's the situation.

I have a block of threads running a while loop and I need the loop to continue if and only if some condition is met by any of these threads. To do that I use a shared variable as the continue flag, the flag is cleared by thread #0 at the beginning of each iteration, followed by a __syncthreads(), and can be set by any thread during the iteration if the continue condition is met. Then another call to __syncthreads() is placed before the checking point of next iteration to make sure threads are synchronized. The kernel is basically like this:

__global__ void foo(void* data) {

__shared__ int blockContinueFlag;

do {

if (threadIdx.x || threadIdx.y || threadIdx.z) {

blockContinueFlag = 0;

}

__syncthreads(); //synch1

//some data manipulations...

if(some predicate) {

blockContinueFlag = true;

}

//some data manipulations...

__syncthreads(); //synch2

} while (blockContinueFlag);

}

The problem is the barrier synch2 doesn't seem to work in my code, sometimes the kernel terminates even when the continue condition is met by some threads (I know this by checking returned data on host side). To further exam this I set a break point just after the do-while loop like following code , where sometimes the blockContinueFlag is said true (I can only assume the block exited the loop prior to some threads can set blockContinueFlag).

__global__ void foo(void* data) {

__shared__ int blockContinueFlag;

do {

if (threadIdx.x || threadIdx.y || threadIdx.z) {

blockContinueFlag = 0;

}

__syncthreads(); //synch1

//some data manipulations...

if(some predicate) {

blockContinueFlag = true;

}

//some data manipulations...

__syncthreads(); //synch2

} while (blockContinueFlag);

//a break point is set here

}

I remember reading from cuda manual that __syncthreads() is allowed in conditional clause if the predicate is evaluated same for all threads, which should be in this case.

I have another simplified version of code just as an illustration for this.

__global__ void foo(int* data, int kernelSize, int threshold) {

__shared__ int blockContinueFlag;

do {

if (threadIdx.x == 0) {

blockContinueFlag = 0;

}

__syncthreads();

if (threadIdx.x < kernelSize) {

data[threadIdx.x]--;

for (int i = 0; i < threadIdx.x; i++);

if (data[threadIdx.x] > threshold)

blockContinueFlag = true;

}

__syncthreads();

} while (blockContinueFlag);

}

int main()

{

int hostData[1024], *deviceData;

for (int i = 0; i < 1024; i++)

hostData[i] = i;

cudaMalloc(&deviceData, 1024 * sizeof(int));

cudaMemcpy(deviceData, hostData, 1024 * sizeof(int), cudaMemcpyHostToDevice);

foo << <1, 1024 >> >(deviceData, 512, 0);

cudaDeviceSynchronize();

cudaMemcpy(hostData, deviceData, 1024 * sizeof(int), cudaMemcpyDeviceToHost);

fprintf(stderr, cudaGetErrorString(cudaGetLastError()));

return 0;

}

The expected value for hostData[] would be {-511, -510, -509, ..., 0, 512, 513, 514,..., 1023} at the end of main(), which is sometimes the actual case. But in some case it produces following values in VS 2013 debug mode

[0]: -95

[1]: -94

...

[29]: -66

[30]: -65

[31]: -64

[32]: 31

[33]: 32

[34]: 33

...

[61]: 60

[62]: 61

[63]: 62

[64]: -31

[65]: -30

[66]: -29

...

[92]: -3

[93]: -2

[94]: -1

[95]: 0

[96]: 95

[97]: 96

[98]: 97

...

, which suggests warps are not actually synchronized.

So does anyone know the reason for this and/or whether there is a way to let the thread barrier work correctly?

Any help would be appreciated. Thanks in advance.

网友答案:

The first example you have checking the condition and clearing the flag in the same code fragment between syncthreads. That's a write-after-read hazard. To better exemplifiy your problem, let me rewrite your example like this:

__global__ void foo(void* data) {
  __shared__ int blockContinueFlag;
  blockContinueFlag = true;
  while (true) {
    if (!blockContinueFlag)
        break;
    if (threadIdx.x || threadIdx.y || threadIdx.z) {
        blockContinueFlag = 0;
    }
    __syncthreads(); //synch1
    //some data manipulations...
    if(some predicate) {
      blockContinueFlag = true;
    }
    //some data manipulations...
    __syncthreads(); //synch2
  };

In this example, the check for the flag and the loop break is more verbose, but it is essentially the same code (plus the redundant check at the very beginning).

In this example, as well as in your code, the thread 0 may check the loop condition and clear the flag, before thread 33 (another warp) performs the check. This causes divergence, and all the evil gets loose.

To fix - you need to add yet another __syncthreads() before clearing the flag.

网友答案:

So here's my solution with one __syncthreads_or() instead of three __syncthreads() as requested.

__global__ void foo(void* data) {
    int blockContinueFlag;
    do {
        blockContinueFlag = 0;
        //some data manipulations...
        if(some predicate) {
            blockContinueFlag = true;
        }
        //some data manipulations...
    } while (__syncthreads_or(blockContinueFlag));
}

In practice this is slightly faster than three syncthreads's.

Thanks again for your posts.

相关阅读:
Top