The Nullman

The SIMD guide for C and C++ programmers who are also retarded
By The Nullman
The basics
SIMD stands for Single Instruction, Multiple Data. It's a vectorized beyond-scalar high-performance register-space technology given to us by The Lord! praise be! The following code works here to illustrate a scalar approach:
// Scalar
float nums[4] = // ...
float nums2[4] = // ...
float nums3[4]; // Empty
for (int i = 0; i < 4; i++)
{
nums3[i] = nums[i] + nums2[i];
}
The problem in the above code is the large instruction count. Across larger arrays (E.G., 6 billion nums) we're dealing with an insane amount of cycles. But what if we could process in batches?
The solution (from The Lord)
Your first thought might be to unroll the loop, process 4 nums at once as 4 lines of code. This can be faster and help the CPU handle things better, but it won't provide a high enough speedup. Hence on the 6th day, God said, "Let there be 4 floats at once" praise be! With The Lord as Intel's shepherd praise be!, they would go on to make SSE. Standing for SIMD Streaming Extensions, this is the core SIMD technology you'll be using as a result of it's wide support and ease of use.
Getting started with SSE in your C/C++ project.
To begin, include headers based on platform. Usually you'll do something like:
#if MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
The header x86intrin.h
is portable across most compilers, except for VC++.
On VC++ you'll want to include intrin.h
which pulls in all of the intrinsics. This will include intrinsics for ARM if you're on the ARM platform.
The part where I explain them
The core of every SIMD technology is the vector register, a CPU wide register like xmm0, xmm1, etc... (x86 SSE specific)
If you're a fan of linear algebra you'll be a fan of these vectors.
But if you're not a fan, you can think of these vector registers like arrays that have already been allocated, have a static size, and must be accessed using a custom function.
These are wide registers, beyond the normal size of 32 or 64 bits that your platform normally handles.
With SSE you're dealing with 128 bit wide registers. AVX 256 bit wide registers, AVX-512 512 bit wide registers.
Some older SIMD solutions (E.G., Intel MMX, AMD 3dnow!) used 64 bit long registers instead, but most solutions on x86 nowadays are entirely wide register based
At their core, these registers are just a wide data store. But they're designed to store packed data in a vector of set size. (E.G., 128 bit wide register supports 4 32 bit single precision floats or 2 64 bit double
-precision floats)
What's special about them are the available instructions, which operate upon data that's packed into the register.
The part where you use them
Now that you have intrinsics, you can get a vector. The datatypes available to you are:
- __m128, the SSE 128 bit wide register that supports single-precision floats
- __m128i, the SSE 128 bit wide register that supports integers
- __m128d, the SSE 128 bit wide register that supports
double
-precision floats
__m128 v0, v1, v2;
v0 = _mm_set_ps(1.f, 4.f, 1024.f, 2.f); // <- The multiple data
v1 = _mm_set_ps(217834.f, 3.f, 6.f, 9.f);
v2 = _mm_setzero_ps();
v2 = _mm_add_ps(v0, v1); // <- The single instruction.
float fILoveHungarianNotation = 0.f;
fILoveHungarianNotation = (float)_mm_extract_ps(v2, 0);
In this case we define 3 vectors with their intrinsic types, corresponding to registers in the CPU
The first vector is set using _mm_set_ps
to contain [1.f, 4.f, 1024.f, 2.f]
The second vector is set to contain [217834.f, 3.f, 6.f, 9.f]
The third vector, our destination is set to be entirely zero using _mm_setzero_ps
We then add these two using _mm_add_ps which adds two float vectors.
Then we use _mm_extract_ps
on lane 0
(the argument) to get a value out of our destination vector.
Gotcha!
If you've run the above code in your own situation, you might notice that fILoveHungarianNotation
contains 11.f
This is because Intel and The Lord praise be! love little-endianness.
They had to choose an order, and as long as it's consistent, it's okay. Therefore, what you think of as the order, is actually the reverse
With a vector like [1.f, 2.f, 3.f, 4.f]
, extracting lane 0
will get you 4.f
If you're not a fan of this, or require specific re-ordering, better to use _mm_setr_ps
which sets them in inverse order to allow your big-endian brain to more easily understand/preserves order.
_mm_set_ps
and _mm_setr_ps
are both equally fast, so you're fine.
The part where I explain the naming scheme
If you're literate, you might've noticed that the naming scheme looks weird. But fortunately, it's highly descriptive, and uses the hungarian notation.
_mm whaaaa
Things in SSE, AVX, are prefaced with _mm because of MMX, the old Intel long register SIMD extension.
The other stuff
PS stands for packed single-precision. Meaning 32 bit float
s.
PD stands for packed double
-precision. Meaning 64 bit double
s.
SS stands for single single-precision float. Usually this means the lower lane.
E.G., _mm_add_ss
adds the lower (lane 0) single-precision float in both of it's __m128 arguments.
EPI means signed integer. So the int
data type.
The following integer is the size in bits. epi8 is an 8 bit integer or int8_t. epi16 is a 16 bit integer or int16_t
EPU is the unsigned version of EPI. So the unsigned int
data type.
The following integer is also the size in bits here. epu8 is an 8 bit unsigned integer or uint8_t. epu16 is a 16 but unsigned integer or uint16_t.
CMP stands for compare. So _mm_cmpeq_ps
compares (checking for equivalency) two PS packed single-precision float vectors (__m128
)
CVT stands for convert, but also copy.
E.G., _mm_cvtps_epi32
converts PS packed single-precision float
s in a vector to epi32 int32_t signed 32 bit integers (__m128i
)
Meanwhile, _mm_cvtss_f32
returns the lower single-precision float from it's __m128 argument. (This is a copy operation in assembler, the intrinsic just makes it return.)
These are the core names you'll find, the rest are highly intuitive and easy to understand once you use one of them.
Resources from Intel
The Intel Intrinsics guide lets you search. Find it here