The Steam Hardware & Software Survey as of September 2013 shows that 68% of users’ CPU support SSE4.1, leaving 32% still using SSE3 or less. It also shows that 99.7% of users’ CPU support SSE3. So what this tells me is that SSE3 is where it’s at, if you want your game to run on most CPUs, you won’t be able to use SSE 4.1 intrinsic functions.

Problem is, SSE4.1 is awesome. It’s awesome for many reasons but let’s focus on one of them: **_mm_round_ps()**. You can use this intrinsic, which yield a single ‘**roundps**‘ instruction, for computing math **Floor**, **Ceil** and – as its name suggest – **Round**.

Well this is great and all but what can we do using non-SSE4.1 instructions? You can always use the standard C math.h **floor()**, **ceil()** and your own flavor of **round()** but if you want to do some intensive and fast vector math in your game, SSE is your savior. If you do a search on Google, something like ‘sse floor’, you’ll probably get a lot of wrong algorithms. Most of them won’t work for negative integer values, flooring -10 to -11 for instance. So I took some time to figure out an algorithm using only SSE3 instructions.

inline __m128 _mm_floor_ps2(const __m128& x){ __m128i v0 = _mm_setzero_si128(); __m128i v1 = _mm_cmpeq_epi32(v0,v0); __m128i ji = _mm_srli_epi32( v1, 25); __m128 j = *(__m128*)&_mm_slli_epi32( ji, 23); //create vector 1.0f __m128i i = _mm_cvttps_epi32(x); __m128 fi = _mm_cvtepi32_ps(i); __m128 igx = _mm_cmpgt_ps(fi, x); j = _mm_and_ps(igx, j); return _mm_sub_ps(fi, j); } inline __m128 _mm_ceil_ps2(const __m128& x){ __m128i v0 = _mm_setzero_si128(); __m128i v1 = _mm_cmpeq_epi32(v0,v0); __m128i ji = _mm_srli_epi32( v1, 25); __m128 j = *(__m128*)&_mm_slli_epi32( ji, 23); //create vector 1.0f __m128i i = _mm_cvttps_epi32(x); __m128 fi = _mm_cvtepi32_ps(i); __m128 igx = _mm_cmplt_ps(fi, x); j = _mm_and_ps(igx, j); return _mm_add_ps(fi, j); } inline __m128 _mm_round_ps2(const __m128& a){ __m128 v0 = _mm_setzero_ps(); //generate the highest value < 2 __m128 v1 = _mm_cmpeq_ps(v0,v0); __m128 vNearest2 = *(__m128*)&_mm_srli_epi32( *(__m128i*)&v1, 2); __m128i i = _mm_cvttps_epi32(a); __m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a __m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder __m128 rmd2 = _mm_mul_ps( rmd, vNearest2); // mul remainder by near 2 will yield the needed offset __m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course __m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i); __m128 r =_mm_add_ps(aTrunc, rmd2Trunc); return r; }

**Edit:** Special thanks to obyzouth, he worked out better SSE code for floor and ceil functions. 🙂 It does not handle the NaNs and Infinite but those don’t have to be handled, they have to be eradicated. A good NaN is a non-existent NaN. You should use functions that handle them in your debug build though.

As you can see I use a conversion to **int** and back to **float** to round the value. This will not work for values that cannot be represented by an **int**. If you need these function to handle these kind of values you might have to reconsider using **float** to begin with as it loses quite a lot of precision in these ranges. But for the sake of absolute safety, here’s what you can do:

template< __m128 (FuncT)(const __m128&) > inline __m128 _mm_safeInt_ps(const __m128& a){ __m128 v8388608 = *(__m128*)&_mm_set1_epi32(0x4b000000); //vector with value 8388608 __m128 aAbs = _mm_and_ps(a, *(__m128*)&_mm_set1_epi32(0x7fffffff)); //Abs(a) __m128 aMask = _mm_cmpgt_ps(aAbs, v8388608); //if Abs(a) > 8388608 // select a if greater then 8388608.0f, otherwise select the result of FuncT __m128 r = _mm_xor_ps( _mm_and_ps(aMask, a), _mm_andnot_ps(aMask, FuncT(a)) ); return r; } ... //then call your functions like so: _mm_safeInt_ps<_mm_floor_ps2>( ... ); _mm_safeInt_ps<_mm_ceil_ps2 >( ... ); _mm_safeInt_ps<_mm_round_ps2>( ... );

8388608 is the lowest float value that cannot have decimals because of imprecision increasing with the value. So** floor()**/**ceil()**/**round()** will return the same value it receives for number greater or equal to that.

**Bonus!** Vector equivalent of **fmod()** :

inline __m128 _mm_mod_ps2(const __m128& a, const __m128& aDiv){ __m128 c = _mm_div_ps(a,aDiv); __m128i i = _mm_cvttps_epi32(c); __m128 cTrunc = _mm_cvtepi32_ps(i); __m128 base = _mm_mul_ps(cTrunc, aDiv); __m128 r = _mm_sub_ps(a, base); return r; }

** **