How to Copy Many Doubles in Single Threaded Code? #

The simple answer is:

memcpy(y, x, n);

We measure 34 GB/s copied which is now our baseline. Can we easily match or beat this using SIMD? This question will drive the document. The code is available here.

We compile with GCC using:

-g -O3 -march=native

The machine we’re running on has an Intel i7 11800H with AVX-512. It’s RAM is dual channel, and has two memory banks, running at (presumably) 3200 MT/s, rather than the advertised 3200 MHz. This results in a bandwidth in a theoretical peak bandwidth of 50 GB/s.

The benchmark is to copy 8 GB of doubles.

Attempt 1: Simple for loop. #

void copy_v1(double * const y, double const * const x, size_t n) {
    for(size_t i = 0; i < n; ++i) {
        y[i] = x[i];
    }
}

We measure 18 GB/s copied. Clearly, something isn’t right.

Attempt 2: Try `restrict`. #

Let’s try again but use __restrict__ to help the compiler understand that it doesn’t need to worry about aliasing:

void copy_v2(double * const __restrict__ y,
             double const * const __restrict__ x,
             size_t n) {

The body is the same as in Attempt 1. Now the performance is 34 GB/s. So let’s look at what GCC generated; and hopefully we’ll understand exactly which instructions we need. Here goes:

1379:    e9 b2 fc ff ff      jmp    1030 <memcpy@plt>

Which is interesting for all the wrong reasons.

Attempt 3: Intrinsics #

We can use intrinsics to convince the compiler to not optimize everything away:

#include <immintrin.h>

void copy_avx(double * y, double const * x, size_t n) {
  while(n != 0) {
    __m512d xi = _mm512_load_pd(x);
    _mm512_store_pd(y, xi);

    x += 8;
    y += 8;
    n -= 8;
  }
}

This results in 18 GB/s. Not great. We can try unrolling that doesn’t change the performance.

Attempt 4: Non-temporal Store #

On x86_64 we can perform non-temporal writes. A regular write, such as _mm512_store_pd above on a cold memory location will trigger reading the value before writing it to L3. From where it’ll later be evicted. This is called a write-allocate. Using non-temporal stores we can avoid this additional read. The disadvantage is that y will not be in cache after the copy. Hence, this only works for large arrays.

This slight change to the loop body is enough to use non-temporal stores:

__m512d xi = _mm512_load_pd(x);
_mm512_stream_pd(y, xi);

we measure now 28 GB/s. Markedly better, but still slower than memcpy.

We could again try manually unrolling and reordering the loop, because less aliasing threat and more SIMD registers used. We could try other SIMD instructions, SSE and AVX2. None of these ideas seem to matter much or at all.

Attempt 5: Disassembly `memcpy` #

We can use GDB to step through the code. We set a breakpoint on copy_memcpy and run to barrel past the setup code. Calling layout asm opens a panel with the assembly code. Using si we can step one assembly instruction at a time. We keep Enter pressed and watch a litany of mov and jes run across our screen, until eventually we enter a steady state. What we find is the following:

prefetchn0 0x80(%rsi)
prefetchn0 0xc0(%rsi)

prefetchn0 0x1080(%rsi)
prefetchn0 0x10c0(%rsi)

prefetchn0 0x2080(%rsi)
prefetchn0 0x20c0(%rsi)

prefetchn0 0x3080(%rsi)
prefetchn0 0x30c0(%rsi)

vmovdqu64 (%rsi),%ymm16
vmovdqu64 0x20(%rsi),%ymm17
vmovdqu64 0x40(%rsi),%ymm18
vmovdqu64 0x60(%rsi),%ymm19

vmovdqu64 0x1000(%rsi),%ymm20
vmovdqu64 0x1020(%rsi),%ymm21
vmovdqu64 0x1040(%rsi),%ymm22
vmovdqu64 0x1060(%rsi),%ymm23

vmovdqu64 0x2000(%rsi),%ymm24
vmovdqu64 0x2020(%rsi),%ymm25
vmovdqu64 0x2040(%rsi),%ymm26
vmovdqu64 0x2060(%rsi),%ymm27

vmovdqu64 0x3000(%rsi),%ymm28
vmovdqu64 0x3020(%rsi),%ymm29
vmovdqu64 0x3040(%rsi),%ymm30
vmovdqu64 0x3060(%rsi),%ymm31

vmovntdq %ymm16,(%rsi)
vmovntdq %ymm17,0x20(%rsi)
vmovntdq %ymm18,0x40(%rsi)
vmovntdq %ymm19,0x60(%rsi)

vmovntdq %ymm20,0x1000(%rsi)
vmovntdq %ymm21,0x1020(%rsi)
vmovntdq %ymm22,0x1040(%rsi)
vmovntdq %ymm23,0x1060(%rsi)

vmovntdq %ymm24,0x2000(%rsi)
vmovntdq %ymm25,0x2020(%rsi)
vmovntdq %ymm26,0x2040(%rsi)
vmovntdq %ymm27,0x2060(%rsi)

vmovntdq %ymm28,0x3000(%rsi)
vmovntdq %ymm29,0x3020(%rsi)
vmovntdq %ymm30,0x3040(%rsi)
vmovntdq %ymm31,0x3060(%rsi)

followed by some code to decrement a counter and increment the pointers %rsi and %rsi and a jump back to the top needed to make the loop work. Superficially, this looks uninteresting: prefetch, loop unroll and a non-temporal store. At second glance, it’s prefetching some pretty odd addresses. Less odd once we recognize 0x1000 as 4096, i.e. one page. The important difference seems to be that we need to read from multiple pages.

We can try mimicking this with intrinsics:

void copy_pages(double * y, double const * x, size_t n) {
  size_t page_size = 0x1000 / sizeof(double);

  while(n != 0) {
    for(size_t i = 0; i < page_size; i += 8) {
      __m512d x0i = _mm512_load_pd(x + 0 * page_size);
      __m512d x1i = _mm512_load_pd(x + 1 * page_size);
      __m512d x2i = _mm512_load_pd(x + 2 * page_size);
      __m512d x3i = _mm512_load_pd(x + 3 * page_size);

      _mm512_stream_pd(y + 0 * page_size, x0i);
      _mm512_stream_pd(y + 1 * page_size, x1i);
      _mm512_stream_pd(y + 2 * page_size, x2i);
      _mm512_stream_pd(y + 3 * page_size, x3i);

      x += 8;
      y += 8;
    }

    x += 3 * page_size;
    y += 3 * page_size;
    n -= 4 * page_size;
  }
}

and we now measure the same bandwidth of 34 GB/s as memcpy.