If you read back to my comment in that thread, I explicitly touched on the GPU vs CPU issue. They were benchmarking and ranking algorithms based on CPU performance while allegedly investigating the problem for GPU implementation. Pretty silly of them, but that's besides the point. CUDA and OpenCL and even shader languages like HLSL, GLSL and Cg all have branchless conditional move instructions. They've always had them going back to the days before GPU cores had branching. That's what you should use here, not some homegrown bit-bashing crap. For the higher-level languages like CUDA and OpenCL, the compilers have no problem generating branchless conditional moves from C code that uses branches in a simple and transparent way.
The point is that the 'simple and optimal instruction sequence' I mentioned in my top post has a 1:1 equivalent on every modern CPU and GPU.
The point is that the 'simple and optimal instruction sequence' I mentioned in my top post has a 1:1 equivalent on every modern CPU and GPU.