How to Copy Many Doubles in Single Threaded Code? #
Code: bit-bcast/copy
The simple answer is:
memcpy(y, x, n);
We measure 34 GB/s copied which is now our baseline. Can we easily match or beat this using SIMD? This question will drive the document. The code is available here.
We compile with GCC using:
-g -O3 -march=native
The machine we’re running on has an Intel i7 11800H with AVX-512. It’s RAM is dual channel, and has two memory banks, running at (presumably) 3200 MT/s, rather than the advertised 3200 MHz. This results in a bandwidth in a theoretical peak bandwidth of 50 GB/s.
The benchmark is to copy 8 GB of doubles.
Attempt 1: Simple for loop. #
void copy_v1(double * const y, double const * const x, size_t n) {
for(size_t i = 0; i < n; ++i) {
y[i] = x[i];
}
}
We measure 18 GB/s copied. Clearly, something isn’t right.
Attempt 2: Try __restrict__
.
#
Let’s try again but use __restrict__
to help the compiler understand that it
doesn’t need to worry about aliasing:
void copy_v2(double * const __restrict__ y,
double const * const __restrict__ x,
size_t n) {
The body is the same as in Attempt 1. Now the performance is 34 GB/s. So let’s look at what GCC generated; and hopefully we’ll understand exactly which instructions we need. Here goes:
1379: e9 b2 fc ff ff jmp 1030 <memcpy@plt>
Which is interesting for all the wrong reasons.
Attempt 3: Intrinsics #
We can use intrinsics to convince the compiler to not optimize everything away:
#include <immintrin.h>
void copy_avx(double * y, double const * x, size_t n) {
while(n != 0) {
__m512d xi = _mm512_load_pd(x);
_mm512_store_pd(y, xi);
x += 8;
y += 8;
n -= 8;
}
}
This results in 18 GB/s. Not great. We can try unrolling that doesn’t change the performance.
Attempt 4: Non-temporal Store #
On x86_64
we can perform non-temporal writes. A regular write, such as
_mm512_store_pd
above on a cold memory location will trigger reading the
value before writing it to L3. From where it’ll later be evicted. This is
called a write-allocate. Using non-temporal stores we can avoid this additional
read. The disadvantage is that y
will not be in cache after the copy. Hence,
this only works for large arrays.
This slight change to the loop body is enough to use non-temporal stores:
__m512d xi = _mm512_load_pd(x);
_mm512_stream_pd(y, xi);
we measure now 28 GB/s. Markedly better, but still slower than memcpy
.
We could again try manually unrolling and reordering the loop, because less aliasing threat and more SIMD registers used. We could try other SIMD instructions, SSE and AVX2. None of these ideas seem to matter much or at all.
Attempt 5: Disassembly memcpy
#
We can use GDB to step through the code. We set a breakpoint on copy_memcpy
and run
to barrel past the setup code. Calling layout asm
opens a panel
with the assembly code. Using si
we can step one assembly instruction at a
time. We keep Enter
pressed and watch a litany of mov
and je
s run across
our screen, until eventually we enter a steady state. What we find is the
following:
prefetchn0 0x80(%rsi)
prefetchn0 0xc0(%rsi)
prefetchn0 0x1080(%rsi)
prefetchn0 0x10c0(%rsi)
prefetchn0 0x2080(%rsi)
prefetchn0 0x20c0(%rsi)
prefetchn0 0x3080(%rsi)
prefetchn0 0x30c0(%rsi)
vmovdqu64 (%rsi),%ymm16
vmovdqu64 0x20(%rsi),%ymm17
vmovdqu64 0x40(%rsi),%ymm18
vmovdqu64 0x60(%rsi),%ymm19
vmovdqu64 0x1000(%rsi),%ymm20
vmovdqu64 0x1020(%rsi),%ymm21
vmovdqu64 0x1040(%rsi),%ymm22
vmovdqu64 0x1060(%rsi),%ymm23
vmovdqu64 0x2000(%rsi),%ymm24
vmovdqu64 0x2020(%rsi),%ymm25
vmovdqu64 0x2040(%rsi),%ymm26
vmovdqu64 0x2060(%rsi),%ymm27
vmovdqu64 0x3000(%rsi),%ymm28
vmovdqu64 0x3020(%rsi),%ymm29
vmovdqu64 0x3040(%rsi),%ymm30
vmovdqu64 0x3060(%rsi),%ymm31
vmovntdq %ymm16,(%rsi)
vmovntdq %ymm17,0x20(%rsi)
vmovntdq %ymm18,0x40(%rsi)
vmovntdq %ymm19,0x60(%rsi)
vmovntdq %ymm20,0x1000(%rsi)
vmovntdq %ymm21,0x1020(%rsi)
vmovntdq %ymm22,0x1040(%rsi)
vmovntdq %ymm23,0x1060(%rsi)
vmovntdq %ymm24,0x2000(%rsi)
vmovntdq %ymm25,0x2020(%rsi)
vmovntdq %ymm26,0x2040(%rsi)
vmovntdq %ymm27,0x2060(%rsi)
vmovntdq %ymm28,0x3000(%rsi)
vmovntdq %ymm29,0x3020(%rsi)
vmovntdq %ymm30,0x3040(%rsi)
vmovntdq %ymm31,0x3060(%rsi)
followed by some code to decrement a counter and increment the pointers %rsi
and %rsi
and a jump back to the top needed to make the loop work.
Superficially, this looks uninteresting: prefetch, loop unroll and a
non-temporal store. At second glance, it’s prefetching some pretty odd
addresses. Less odd once we recognize 0x1000
as 4096, i.e. one page. The
important difference seems to be that we need to read from multiple pages.
We can try mimicking this with intrinsics:
void copy_pages(double * y, double const * x, size_t n) {
size_t page_size = 0x1000 / sizeof(double);
while(n != 0) {
for(size_t i = 0; i < page_size; i += 8) {
__m512d x0i = _mm512_load_pd(x + 0 * page_size);
__m512d x1i = _mm512_load_pd(x + 1 * page_size);
__m512d x2i = _mm512_load_pd(x + 2 * page_size);
__m512d x3i = _mm512_load_pd(x + 3 * page_size);
_mm512_stream_pd(y + 0 * page_size, x0i);
_mm512_stream_pd(y + 1 * page_size, x1i);
_mm512_stream_pd(y + 2 * page_size, x2i);
_mm512_stream_pd(y + 3 * page_size, x3i);
x += 8;
y += 8;
}
x += 3 * page_size;
y += 3 * page_size;
n -= 4 * page_size;
}
}
and we now measure the same bandwidth of 34 GB/s as memcpy
.