I decided to continue Fast corners optimisation and stucked at

`_mm_movemask_epi8` SSE instruction. How can i rewrite it for ARM Neon with `uint8x16_t` input?

I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument.

``````const uint8_t __attribute__ ((aligned (16))) _Powers=
{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };

// Set the powers of 2 (do it once for all, if applicable)
uint8x16_t Powers= vld1q_u8(_Powers);

// Compute the mask from the input
uint64x2_t Mask= vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(Input, Powers))));

// Get the resulting bytes
uint16_t Output;
vst1q_lane_u8((uint8_t*)&Output + 0, (uint8x16_t)Mask, 0);
vst1q_lane_u8((uint8_t*)&Output + 1, (uint8x16_t)Mask, 8);
``````

(Mind http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553, anyway.)

Similarly to Michael, the trick is to form the powers of the indexes of the non-null entries, and to sum them pairwise three times. This must be done with increasing data size to double the stride on every addition. You reduce from 2 x 8 8-bit entries to 2 x 4 16-bit, then 2 x 2 32-bit and 2 x 1 64-bit. The low byte of these two numbers gives the solution. I don't think there is an easy way to pack them together to form a single short value using NEON.

Takes 6 NEON instructions if the input is in the suitable form and the powers can be preloaded.

Note that I haven't tested any of this, but something like this might work:

``````X := the vector that you want to create the mask from
A := 0x808080808080...
B := 0x00FFFEFDFCFB...  (i.e. 0,-1,-2,-3,...)

X = vand_u8(X, A);  // Keep d7 of each byte in X
X = vshl_u8(X, B);  // X>>=0; X>>=1; X>>=2; ...
// Each byte of X now contains its msb shifted 7-N bits to the right, where N
// is the byte index.
// Do 3 pairwise adds in order to pack all these into X
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
// X should now contain the mask. Clear the remaining bytes if necessary
``````

This would need to be repeated once to process a 128-bit vector, since `vpadd` only works on 64-bit vectors.

after some tests it looks like following code works correct:

``````int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
const int8_t __attribute__ ((aligned (16))) xr = {-7,-6,-5,-4,-3,-2,-1,0};
uint8x8_t mask_and = vdup_n_u8(0x80);
int8x8_t mask_shift = vld1_s8(xr);

uint8x8_t lo = vget_low_u8(input);
uint8x8_t hi = vget_high_u8(input);

lo = vand_u8(lo, mask_and);
lo = vshl_u8(lo, mask_shift);

hi = vand_u8(hi, mask_and);
hi = vshl_u8(hi, mask_shift);

lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);

hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);

return ((hi << 8) | (lo & 0xFF));
}
``````

Top