IS SIMD always faster ?


All modern CPUS carry SIMD hardware units that software can exploit to run code even faster. But does using SIMD mean our code is always faster ?

In this blog post, I write down about my experience trying to optimize an algorithm to use faster hardware and report the benchmarks and my conclusions.

TL;DR: it really depends on the algorithm, data size AND other subsystems of the CPU

Problem context

I was tasked with coming up with a benchmark for TF-IDF calculation

TF-IDF stands for “Term Frequency-Inverse Document Frequency”, a statistical measure for calculating the relevancy of words in a given piece of text. The measure is a multiplication of 2 measures

  1. Term Frequency - The number of times a term (or a word) appears in a given piece of text
  2. Inverse Document Frequency - a number derived inversely proportional to the number of times it appears in a given corpus

A corpus is a dataset of text (say, a selected set of books or text that we use as our knowledge base) The idea behind the Inverse Document Frequency is that words that appear more frequently in a text have fewer information (words like: a,the, for, even the) encoded in them compared to words that appear rarer. We can use the inverse of the logarithm of the frequency of a word in a given corpus to calculate the inverse document frequency After reading through SciPy’s source code, this is the implementation of IDF I came up with:

fn calculate_idf(&mut self) {
        // each unique word (token) in our corpus has a term_id
        for term_id in 0..self.tokens.len() {
            let mut df = 0;
            for document in self.documents.iter() {
                if document[term_id] > 0.0 {
                    df += 1;
                }
            }
            let n_documents = self.documents.len();
            let present = df as f64;
            let idf = (((n_documents + 1) as f64) / (present + 1.0)).ln() + 1.0;
            self.idfs.insert(term_id, idf);
        }
    }

Given the mapping from a term to its IDF value, we can calculate the TF-IDF of any input text as a vector (or array. Both terms are used interchangeably in this document)

    fn tfidf_representation(&self, input: &str) -> Vec<f64> {
        let mut row_vector = vec![0.0; self.tokens.len()];
        let tokens = self.tokenize(input);
        for token in tokens.iter() {
            let term_id = self.tokens_index.get(token).unwrap();
            let term_idf = self.idfs.get(term_id).unwrap();
            row_vector[*term_id] += *term_idf;
        }
        l2_normalize(&mut row_vector);
        return row_vector;
    }

The representation of any input text is a vector. The length of the vector = number of terms in our corpus.

Now that we have a vector representation of our document, we can use many measures to perform a “search”. For example, we can use cosine similiarity (among many others) to search for documents that might match our input query.

Problem

While it is convenient to represent a piece of text as an array of numbers, the array is largely empty for most input queries, even documents.

As an example, assume that our corpus as 65000 unique words. In our TF-IDF representation, our arrays for each document will have 65000 elements in them, each multiplied by the frequency of the term in each document. An input document like “all our queries are belongs to us” has only 8 words, so except for maybe 8 locations, the array will be filled with 0s.

vector_for_input = [0.0, 0.0, 0.0, 0.45, 0.12,...... 0.0, 0.0]'

Assuming that we are using 4-byte floating point numbers to represent a tf-idf element, we are wasting 64996*4 = 259984 bytes (250KB!) per document just to represent the lack of a value. Even with 2 or 1 byte floating points, that is still ~128KB or ~64KB per document

Solutions

A common solution to this problem is to use Sparse Vectors. A Sparse vector is a data structure that stores only the non-zero elements and their indices. This way, we can avoid tracking indices that do not store any value.

#[derive(Clone, Debug)]
struct SparseVector {
    indices: Vec<u16>,
    values: Vec<f32>,
}

Since our corpus is made of only 65000 words, we can represent each term (thus index in our vectors) using a 2-byte u16. With a 4-byte floating point for each tf-idf value, our document representation now has a size of indices: (8x2) bytes + values: (8x4) bytes = 48 bytes ! Compare that to 64,128 or 256KB !

We can represent of our SparseVector situated in memory like this

         1 | 5 | 17 | 120 | 2013 | 4016 | 5123 | 101234 |
0xfffecd 
         0.1 | 0.3 | 0.0112 | 0.0004 | 0.0076 | 0.0134 | 0.098 | 0.012 |
0xeeffecd   

The hexadecimal suffixes are starting memory address for the array. Note that the indices of the SparseVector are sorted. This is crucial, otherwise our algorithm won’t work.

With a SparseVector representation using sorted indices, it is easy to calculate the dot product against another SparseVector: We simply loop through the indices of both vectors and multiply the values of matching indices. Since the indices are sorted, we can simply adjust the pointers as needed, without having to look for a matching index from the start of the indices array, thus calculating the dot product in linear (O(n)) time

 fn sparse_dot_product(&self, other: &SparseVector) -> (u16, f64) {
        let mut result = 0.0;
        let mut i = 0;
        let mut j = 0;
        let mut matches: u16 = 0;
        while i < self.indices.len() && j < other.indices.len() {
            if self.indices[i] == other.indices[j] {
                matches += 1;
                result += f64::from( self.values[i] * other.values[j]);
                i += 1;
                j += 1;
            } else if self.indices[i] < other.indices[j] {
                i += 1;
            } else {
                j += 1;
            }
        }

        (matches, result)
    }

SIMD-ification

This algorithm looks ripe for trying to use SIMD instructions to accelerate the task of matching the indices. I will keep this post restricted to just the NEON instruction set ,as it is the most common SIMD instruction set on ARM and is available on pretty much everything from Rpi to Apple’s M4.

ARM has newer instruction sets like SVE and SVE2, but those are rare to find (outside of expensive server chips like Graviton4) and early benchmarks of SVE on the M4 macs indicated that the throughput of SVE is often same or slower than using NEON

A quick introduction to NEON for the readers incase they aren’t aware : NEON registers are 128-bits wide. The registers can be split into lanes that are each 8/16/32/64 wide depending on the instructions used and the data. We can load 8 2-byte indices into a single NEON register like so:

The following code is written in C using NEON intrinsincs

uint16x8_t a_vec, b_vec;
a_vec = vld1q_u16(a) ; // loads 8 u16 values starting at address pointed by pointer a
b_vec = vld1q_u16(b) ;

To load the corresponding weights, we need to use 2 NEON registers/variables per vector, as we are using 4-byte floating point values and to store 8 FP values, we need 4*8 bytes = 256 bytes, thus 2 NEON registers

float32x4_t a_weights_vec_low = vld1q_f32(a_weights);
float32x4_t a_weights_vec_high = vld1q_f32(a_weights + 4);

How can we attempt to speed up our algorithm using SIMD ? Taking a page from Ash Vardanian’s absolutely fantastic intro on Set intersections I tried to implement set intersections with weights for my SparseVector Dot Product implementation

The idea is as follows:

  1. In batches of 8, load indices from vectors a and b.
  2. In each batch, attempt to find matching indices
  3. If found, compute the product of the corresponding weights from the weights_vector

Since there are no native set intersection instructions in NEON, we have to emulate this ourselves.

a : | 3 | 7 | 9 | 15 | 18 | 22 | 34 | 57 |
b : | 0 | 1 | 2 | 3  | 9  | 23 | 32 | 36 | 

In this batch, indices 3 and 9 match in both vectors. We thus multiple a_weights[0] * b_weights[3] and a_weights[2] * b_weights[4]

My code runs a loop for each batch, where each element in b is compared to all the elements in a during every iteration. If we find a match somewhere, then the corresponding weights are multiplied

for (int lane = 0; lane < register_size; lane++) {
        uint16_t elem = EXT_LANE_U16(b_vec, lane);
         uint16x8_t compare_vec = vceqq_u16(a_vec, vdupq_n_u16(elem));
        uint16_t has_matches = vaddvq_u16(compare_vec);
        if(has_matches == 0) {
          continue;
        }
         uint16x4_t matches_low = vget_low_u16(compare_vec);
         uint16x4_t matches_high = vget_high_u16(compare_vec);
         uint32x4_t masks_low = vmovl_s16(matches_low);
         uint32x4_t masks_high = vmovl_s16(matches_high);
         float b_weight = lane < 4 ?
             EXTRACT_LANE_F32(b_weights_vec_low, lane) :
             EXTRACT_LANE_F32(b_weights_vec_high, lane - 4);
         float32x4_t b_weights = vdupq_n_f32(b_weight);
         float32x4_t bl = vbslq_f32(masks_low, b_weights, zeroes);
         float32x4_t bh = vbslq_f32(masks_high, b_weights, zeroes);
         float32x4_t weights_product_low = vmulq_f32(bl, a_weights_vec_low);
         float32x4_t weights_product_high = vmulq_f32(bh, a_weights_vec_high);
         total_product += vaddvq_f32(weights_product_low);
         total_product += vaddvq_f32(weights_product_high);
         uint16_t upper_intersections = vaddv_u16(vshr_n_u16(matches_low, 15));
         uint16_t lower_intersections = vaddv_u16(vshr_n_u16(matches_high, 15));
         intersection_size += upper_intersections;
         intersection_size += lower_intersections;

The above loop can be quite a bit to grok all at once, so lets break it down into smaller chunks

for (int lane = 0; lane < register_size; lane++) {
  uint16_t elem = EXT_LANE_U16(b_vec, lane);
  uint16x8_t compare_vec = vceqq_u16(a_vec, vdupq_n_u16(elem));

register_size is 8 (8 16-bit indices). We iterate through each lane in our SIMD register and load the element of that lane into elem. vdupq_n_u16(elem) results in a SIMD value of type uin16x8_t, where each lane lane is duplicated with the same value. We then compare it lane-wise with our a_vec using vceqq_u16 , resulting in a SIMD value with each lane being 1 , indicating a match, or a 0 indicating mismatch. So if we compare 2 vectors like the following using vceqq_u16

a : | 3 | 7 | 9 | 15 | 18 | 22 | 34 | 57 |
b : | 3 | 3 | 3 | 3  | 3  | 3 |  3  |  3 | 

we will get a result like

res: | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

I used a quick shortcut to check if there are no matches: I first added all the values of the lanes in res, to see if they are 0, using vaddvq_u16. If the sum of all lanes in res is 0, then it means that none of the lanes contained elem, so we can quickly move on to the next iteration of the loop

uint16_t has_matches = vaddvq_u16(compare_vec);
if(has_matches == 0) {
  continue;
}

If the sum > 0, then we do have a match in one of the lanes. How do we figure out which lanes have a match, but do so without running yet another loop ? That’s the core trick in my code !

Since our indices are 16 bits, but our weights are 32 bits (floating points), we can only load 4 weights per SIMD value. So we need 2 float32x4_t SIMD variables to represent the weights of each vector in every batch:

    float32x4_t a_weights_vec_low = vld1q_f32(a_weights);
    float32x4_t a_weights_vec_high = vld1q_f32(a_weights + 4);

To make it easier to work with 2 split weight variables, we also split compare_vec into 2, each consisting of the upper and lower 4 lanes respectively:

uint16x4_t matches_low = vget_low_u16(compare_vec); // lower 4 lanes
uint16x4_t matches_high = vget_high_u16(compare_vec); // upper 4 lanes
uint32x4_t masks_low = vmovl_s16(matches_low); // sign extend each lane from 16-bit to 32-bits 
uint32x4_t masks_high = vmovl_s16(matches_high); // sign extend

the compare_vec vector results in 2 mask vectors of 4 32-bit integer each. The vmovl_s16 inst. sign extends each lane, converting our 16-bit integer to a 32-bit integer

compare_vec: | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
matches_low:  | 0 | 0 | 0 | 0 |
matches_high: | 1 | 0 | 0 | 0 | 
masks_low:  | ...0000 | 0 | 0 | 0 | // 4 32-bit integers
masks_high: | ...1111 | 0 | 0 | 0 | 

a_weights_vec_low and a_weights_vec_high contain the lower and upper weights of our a_vec. masks_low and masks_high indicate if elem of b_vec has a match in the lower and upper halves of a_vec.

Using this we can calculate the dot product of the matching elements, without having to iterate through the lanes of res or masks_low or masks_high variables using the mask variables

float b_weight = lane < 4 ?
    EXTRACT_LANE_F32(b_weights_vec_low, lane) :
    EXTRACT_LANE_F32(b_weights_vec_high, lane - 4); // extract the weight corresponding to elem from b
    
float32x4_t zeroes = vdupq_n_f32(0.0); // a vector with just 0.0 in all lanes
float32x4_t b_weights = vdupq_n_f32(b_weight); // vector with all lanes duplicated to b_weight
float32x4_t bl = vbslq_f32(masks_low, b_weights, zeroes); // if mask lane = 1, then 
float32x4_t bh = vbslq_f32(masks_high, b_weights, zeroes);
float32x4_t weights_product_low = vmulq_f32(bl, a_weights_vec_low);
float32x4_t weights_product_high = vmulq_f32(bh, a_weights_vec_high);

We use the masks to mask out (zero) lanes, where mask is 0, using the vbslq_f32 (bit select) intrinsic. Essentially it works like this:

masks_high: | ...1111 | ...0   | ...0 | ...0 | masks_high: | ...0 | ...0   | ...0 | ...0 | 
b_weights:  | 1100111 | .111   | .101 | .111 | b_weights:  | 0000 | .111   | .101 | .111 | 
bh          | 1100111 | 0000   | 0000 | 0000 | bl          | 0000 | 0000   | 0000 | 0000 | 

If there is a match between lane 7 of a_vec, and elem of b_vec , masks_high will have all 1s in lane 3 of masks_high. After using bit select, only that lane of bl will have the corresponding weight bets sets; all the other lanes will have a zero weight.

When we multiply this with the weights of a_vec split into a_weights_vec_low and a_weights_vec_high, we will ensure that only the values of the matching lanes are multiplied.

bh             | 1100111 | 0000   | 0000 | 0000 | bl             | 0000 | 0000   | 0000 | 0000 |  
                                                X
a_w_high       | 1100001 | 0110   | 0000 | 1111 | a_w_high       | 1101 | 0110   | 1001 | 1111 | 
result         | xxxxxxx | 0000   | 0000 | 0000 |                | 0000 | 0000   | 0000 | 0000 | 

Since all the indices in a Sparse Vectore are unique, we can be sure that only one lane of the resulting vectors (weights_product_high and weights_product_low) will have a resulting value. Therefore we can safely sum up both the vectors to get the product of indices of matching lanes

total_product += vaddvq_f32(weights_product_low);
total_product += vaddvq_f32(weights_product_high);

Similarly, to track the total number of matched indices, we can sum across the masks vector

// vshr_n_16 is a shift right operation. If our mask is all 111s, then shifting right by 15, will leave just a 1 in the LSB
// indicating the count of match in that lane (which is 1)
uint16_t upper_intersections = vaddv_u16(vshr_n_u16(matches_low, 15)); 
uint16_t lower_intersections = vaddv_u16(vshr_n_u16(matches_high, 15));
intersection_size += upper_intersections;
intersection_size += lower_intersections;

In summary, we did the following:

  1. Batch process our SparseVector as 8 elements in each batch
  2. Load weights for each vector / batch into 2 SIMD variables
  3. Loop over each lane in b_vec and compare against all lanes in a.
  4. Use mask vecs to multiply weights of only the matchings lanes

Once we are done processing a batch (end of our initial for loop), we need to figure out how to advance our array pointers to load the next batch.

Recall our initial batch:

a : | 3 | 7 | 9 | 15 | 18 | 22 | 34 | 57 |
b : | 0 | 1 | 2 | 3  | 9  | 23 | 32 | 36 | 

After comparing all elements in b with a, how should we proceed to advance *a and *b ? max(b) is 36 and max(a) is 57. Since our vectors are sorted, we can be assured that we won’t find 36 in the subsequent batches of a anymore. We can thus safely increment b by 8 elements (b + 8).

For a, since max(b) < max(a) , we advance a such that a points to the first element > max(b). In this case, it is 57, so our next load of 8 elements will start at a + 7 that is a + 7.

The lines after the for loop advance the pointers using this logic:

uint8_t askips = skips(a_vec, b_max);
a += 8 - askips;
uint8_t bskips = skips(b_vec, a_max);
b += 8 - bskips;

I implemented the skips fn using intrinsics:

inline uint8_t skips(uint16x8_t indexes, uint16_t max) {
     uint16x8_t smaller_elements = vcleq_u16(indexes, vdupq_n_u16(max)); // lane =  if indexes[lane] < = dup[lane]
     uint8x8_t  low_bits = vshrn_n_u16(smaller_elements, 8);
     uint64_t results = vreinterpret_u64_u8(high_bits);
     int leading_zeroes = __builtin_clzl(results);
     return leading_zeroes / 8;
}

We compare max with each index in indexes. A lane in smaller elements is set to all 1s if max >= indexes[lane] else 0. We then extract the lower most 8-bits of each lane using vshrn_n_u16 (shift right intrinsic) and then consolidated all the 8, 8-bit lanes into a single 64-bit results. We then count the number of leading zero bits using the builtin function __builtin_clzl. Note that higher indexes are loaded in the upper lanes, so the MSB of results will contain the result of comparing max(a) with max(b).

I took a snapshot of the debugger to better visualize this :

* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000100003eac my`skips(indexes=([0] = 3, [1] = 7, [2] = 9, [3] = 15, [4] = 18, [5] = 22, [6] = 34, [7] = 57), max=36) at my.c:10:13
   7   	     uint8x8_t  low_bits = vshrn_n_u16(smaller_elements, 8);
   8   	     uint64_t results = vreinterpret_u64_u8(low_bits);
   9   	     int leading_zeroes = __builtin_clzl(results);
-> 10  	     return leading_zeroes / 8;
   11  	}
   12
   13  	int main() {
(lldb) frame variable
(uint16x8_t) indexes = (3, 7, 9, 15, 18, 22, 34, 57)
(uint16_t) max = 36
(uint16x8_t) smaller_elements = (65535, 65535, 65535, 65535, 65535, 65535, 65535, 0)
(uint8x8_t) low_bits = (0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00)
(uint64_t) results = 0x00ffffffffffffff
(int) leading_zeroes = 8

We divide by 8 in the last line as each element is represented by 8-bits (hence / 8)

Since only one element in > max(b) , we therefore skip a by 8-1 = 7 indexes.

Results

I just described rather arduously, the long way towards SIMDifying the algorithm. Did it speedup my code tremendously ? Sadly no. The algorithm doesn’t seem to be that much faster than the straightforward 20 lines of Rust code. Running some benchmarks here are the results on my M2 Mac Pro, the SIMD code is sometimes fast, but more often slower or sometimes only just a little bit better than the plain version

plain dot product:  64 against 8, avg elapsed_time ns: 166
plain dot product:  64 against 8, avg elapsed_time ns: 187
plain dot product:  64 against 8, avg elapsed_time ns: 141
plain dot product:  64 against 16, avg elapsed_time ns: 195
plain dot product:  64 against 16, avg elapsed_time ns: 183
plain dot product:  64 against 16, avg elapsed_time ns: 174
plain dot product:  64 against 32, avg elapsed_time ns: 254
plain dot product:  64 against 32, avg elapsed_time ns: 216
plain dot product:  64 against 32, avg elapsed_time ns: 137
plain dot product:  128 against 8, avg elapsed_time ns: 346
plain dot product:  128 against 8, avg elapsed_time ns: 345
plain dot product:  128 against 8, avg elapsed_time ns: 191
plain dot product:  128 against 16, avg elapsed_time ns: 491
plain dot product:  128 against 16, avg elapsed_time ns: 375
plain dot product:  128 against 16, avg elapsed_time ns: 370
plain dot product:  128 against 32, avg elapsed_time ns: 429
plain dot product:  128 against 32, avg elapsed_time ns: 399
plain dot product:  128 against 32, avg elapsed_time ns: 295
plain dot product:  512 against 8, avg elapsed_time ns: 1354
plain dot product:  512 against 8, avg elapsed_time ns: 1346
plain dot product:  512 against 8, avg elapsed_time ns: 1404
plain dot product:  512 against 16, avg elapsed_time ns: 1299
plain dot product:  512 against 16, avg elapsed_time ns: 1470
plain dot product:  512 against 16, avg elapsed_time ns: 1454
plain dot product:  512 against 32, avg elapsed_time ns: 1408
plain dot product:  512 against 32, avg elapsed_time ns: 1504
plain dot product:  512 against 32, avg elapsed_time ns: 1395
plain dot product:  1024 against 8, avg elapsed_time ns: 2795
plain dot product:  1024 against 8, avg elapsed_time ns: 2766
plain dot product:  1024 against 8, avg elapsed_time ns: 2754
plain dot product:  1024 against 16, avg elapsed_time ns: 2858
plain dot product:  1024 against 16, avg elapsed_time ns: 2650
plain dot product:  1024 against 16, avg elapsed_time ns: 2783
plain dot product:  1024 against 32, avg elapsed_time ns: 2929
plain dot product:  1024 against 32, avg elapsed_time ns: 2904
plain dot product:  1024 against 32, avg elapsed_time ns: 2916
plain dot product:  2048 against 8, avg elapsed_time ns: 5433
plain dot product:  2048 against 8, avg elapsed_time ns: 5487
plain dot product:  2048 against 8, avg elapsed_time ns: 5337
plain dot product:  2048 against 16, avg elapsed_time ns: 5591
plain dot product:  2048 against 16, avg elapsed_time ns: 5549
plain dot product:  2048 against 16, avg elapsed_time ns: 5479
plain dot product:  2048 against 32, avg elapsed_time ns: 5729
plain dot product:  2048 against 32, avg elapsed_time ns: 5666
plain dot product:  2048 against 32, avg elapsed_time ns: 5921
NEON: 64 vs 8 avg elapsed_time ns: 6654
NEON: 64 vs 8 avg elapsed_time ns: 187
NEON: 64 vs 8 avg elapsed_time ns: 362
NEON: 64 vs 16 avg elapsed_time ns: 325
NEON: 64 vs 16 avg elapsed_time ns: 308
NEON: 64 vs 16 avg elapsed_time ns: 287
NEON: 64 vs 32 avg elapsed_time ns: 279
NEON: 64 vs 32 avg elapsed_time ns: 258
NEON: 64 vs 32 avg elapsed_time ns: 291
NEON: 128 vs 8 avg elapsed_time ns: 800
NEON: 128 vs 8 avg elapsed_time ns: 833
NEON: 128 vs 8 avg elapsed_time ns: 854
NEON: 128 vs 16 avg elapsed_time ns: 737
NEON: 128 vs 16 avg elapsed_time ns: 471
NEON: 128 vs 16 avg elapsed_time ns: 600
NEON: 128 vs 32 avg elapsed_time ns: 445
NEON: 128 vs 32 avg elapsed_time ns: 412
NEON: 128 vs 32 avg elapsed_time ns: 479
NEON: 512 vs 8 avg elapsed_time ns: 3000
NEON: 512 vs 8 avg elapsed_time ns: 2708
NEON: 512 vs 8 avg elapsed_time ns: 2470
NEON: 512 vs 16 avg elapsed_time ns: 1700
NEON: 512 vs 16 avg elapsed_time ns: 2000
NEON: 512 vs 16 avg elapsed_time ns: 1616
NEON: 512 vs 32 avg elapsed_time ns: 1008
NEON: 512 vs 32 avg elapsed_time ns: 1362
NEON: 512 vs 32 avg elapsed_time ns: 1191
NEON: 1024 vs 8 avg elapsed_time ns: 6492
NEON: 1024 vs 8 avg elapsed_time ns: 6179
NEON: 1024 vs 8 avg elapsed_time ns: 3333
NEON: 1024 vs 16 avg elapsed_time ns: 3675
NEON: 1024 vs 16 avg elapsed_time ns: 5908
NEON: 1024 vs 16 avg elapsed_time ns: 5704
NEON: 1024 vs 32 avg elapsed_time ns: 2587
NEON: 1024 vs 32 avg elapsed_time ns: 2166
NEON: 1024 vs 32 avg elapsed_time ns: 1820
NEON: 2048 vs 8 avg elapsed_time ns: 10437
NEON: 2048 vs 8 avg elapsed_time ns: 6620
NEON: 2048 vs 8 avg elapsed_time ns: 8816
NEON: 2048 vs 16 avg elapsed_time ns: 10650
NEON: 2048 vs 16 avg elapsed_time ns: 9462
NEON: 2048 vs 16 avg elapsed_time ns: 11429
NEON: 2048 vs 32 avg elapsed_time ns: 6304
NEON: 2048 vs 32 avg elapsed_time ns: 3392
NEON: 2048 vs 32 avg elapsed_time ns: 4395

One of the reasons could be that the storing the indices and the values as continuous arrays makes the serial algorith extremely cache friendly, and all the duping lanes and masking instructions are simply too much of an overhead vs just reading the values from cache. Also SIMD instructions are not cheap and certain instructions might have way more latency than their serial counterparts. I lack a good way yet to benchmark the latency of NEON instructions on the Mac, so until then I can only speculate why the serial version in this case is better.

Conclusions

Trying to SIMD-ify an algorithm is an extremely worthwhile exercise and lesson in computers and programming. I still believe there is a lot of scope in trying to improve the algorith or optimize the instructions as I am by no means an expert on this subject. The result was a bit disappointing, but not really surprising. It reinforced the notion that just because your code is running on SIMD hardware doesnt mean that your code is running faster. As always, write, measure and observe instead of just making assumptions 😀. 

You can find the code for this post here on my Github