The SIMD guide for C and C++ programmers who are also retarded

By The Nullman

The basics

SIMD stands for Single Instruction, Multiple Data.
It's a vectorized beyond-scalar high-performance register-space technology given to us by The Lord! praise be!
The following code works here to illustrate a scalar approach:


// Scalar 

float nums[4] = // ...
float nums2[4] = // ...

float nums3[4]; // Empty
	
for (int i = 0; i < 4; i++)
{
	nums3[i] = nums[i] + nums2[i];
}

The problem in the above code is the large instruction count.
Across larger arrays (E.G., 6 billion nums) we're dealing with an insane amount of cycles.
But what if we could process in batches?

The solution (from The Lord)

Your first thought might be to unroll the loop, process 4 nums at once as 4 lines of code.
This can be faster and help the CPU handle things better, but it won't provide a high enough speedup.
Hence on the 6th day, God said, "Let there be 4 floats at once" praise be!

With The Lord as Intel's shepherd praise be!, they would go on to make SSE.
Standing for SIMD Streaming Extensions, this is the core SIMD technology you'll be using as a result of it's wide support and ease of use.

Getting started with SSE in your C/C++ project.

To begin, include headers based on platform. Usually you'll do something like:


#if MSC_VER
#include <intrin.h>
#else 
#include <x86intrin.h>
#endif

The header x86intrin.h is portable across most compilers, except for VC++.
On VC++ you'll want to include intrin.h which pulls in all of the intrinsics. This will include intrinsics for ARM if you're on the ARM platform.

The part where I explain them

The core of every SIMD technology is the vector register, a CPU wide register like xmm0, xmm1, etc... (x86 SSE specific)
If you're a fan of linear algebra you'll be a fan of these vectors.
But if you're not a fan, you can think of these vector registers like arrays that have already been allocated, have a static size, and must be accessed using a custom function.

These are wide registers, beyond the normal size of 32 or 64 bits that your platform normally handles.
With SSE you're dealing with 128 bit wide registers. AVX 256 bit wide registers, AVX-512 512 bit wide registers.
Some older SIMD solutions (E.G., Intel MMX, AMD 3dnow!) used 64 bit long registers instead, but most solutions on x86 nowadays are entirely wide register based
At their core, these registers are just a wide data store. But they're designed to store packed data in a vector of set size. (E.G., 128 bit wide register supports 4 32 bit single precision floats or 2 64 bit double-precision floats)
What's special about them are the available instructions, which operate upon data that's packed into the register.

The part where you use them

Now that you have intrinsics, you can get a vector. The datatypes available to you are:

__m128, the SSE 128 bit wide register that supports single-precision floats
__m128i, the SSE 128 bit wide register that supports integers
__m128d, the SSE 128 bit wide register that supports double-precision floats

Let's begin with floats, because that's what you should be using you filthy double-precision float user.
Enjoy this code block for an explanation


__m128 v0, v1, v2;
v0 = _mm_set_ps(1.f, 4.f, 1024.f, 2.f); // <- The multiple data 
v1 = _mm_set_ps(217834.f, 3.f, 6.f, 9.f);
v2 = _mm_setzero_ps();
	
v2 = _mm_add_ps(v0, v1); // <- The single instruction.
	
float fILoveHungarianNotation = 0.f;
fILoveHungarianNotation = (float)_mm_extract_ps(v2, 0);

In this case we define 3 vectors with their intrinsic types, corresponding to registers in the CPU
The first vector is set using _mm_set_ps to contain [1.f, 4.f, 1024.f, 2.f]
The second vector is set to contain [217834.f, 3.f, 6.f, 9.f]
The third vector, our destination is set to be entirely zero using _mm_setzero_ps

We then add these two using _mm_add_ps which adds two float vectors.
Then we use _mm_extract_ps on lane 0 (the argument) to get a value out of our destination vector.

Gotcha!

If you've run the above code in your own situation, you might notice that fILoveHungarianNotation contains 11.f
This is because Intel and The Lord praise be! love little-endianness.
They had to choose an order, and as long as it's consistent, it's okay. Therefore, what you think of as the order, is actually the reverse
With a vector like [1.f, 2.f, 3.f, 4.f], extracting lane 0 will get you 4.f
If you're not a fan of this, or require specific re-ordering, better to use _mm_setr_ps which sets them in inverse order to allow your big-endian brain to more easily understand/preserves order.
_mm_set_ps and _mm_setr_ps are both equally fast, so you're fine.

The part where I explain the naming scheme

If you're literate, you might've noticed that the naming scheme looks weird.
But fortunately, it's highly descriptive, and uses the hungarian notation.

_mm whaaaa

Things in SSE, AVX, are prefaced with _mm because of MMX, the old Intel long register SIMD extension.

The other stuff

PS stands for packed single-precision. Meaning 32 bit floats.

PD stands for packed double-precision. Meaning 64 bit doubles.

SS stands for single single-precision float. Usually this means the lower lane.
E.G., _mm_add_ss adds the lower (lane 0) single-precision float in both of it's __m128 arguments.

EPI means signed integer. So the int data type.
The following integer is the size in bits. epi8 is an 8 bit integer or int8_t. epi16 is a 16 bit integer or int16_t

EPU is the unsigned version of EPI. So the unsigned int data type.
The following integer is also the size in bits here. epu8 is an 8 bit unsigned integer or uint8_t. epu16 is a 16 but unsigned integer or uint16_t.

CMP stands for compare. So _mm_cmpeq_ps compares (checking for equivalency) two PS packed single-precision float vectors (__m128)

CVT stands for convert, but also copy.
E.G., _mm_cvtps_epi32 converts PS packed single-precision floats in a vector to epi32 int32_t signed 32 bit integers (__m128i)
Meanwhile, _mm_cvtss_f32 returns the lower single-precision float from it's __m128 argument. (This is a copy operation in assembler, the intrinsic just makes it return.)

These are the core names you'll find, the rest are highly intuitive and easy to understand once you use one of them.

Resources from Intel

The Intel Intrinsics guide lets you search.
Find it here

The Nullman

You may be looking for Nullman (Kyle Sherman), if so look here.

This site was made by real American patriots, here's how you can tell