Exploring the limits of multicore acceleration

As CPU clock frequencies approach what seems to be an asymptotic limit of 3-5 GHz, and as architectural advances bring the number of cycles per instruction closer to one, parallel processing is the inevitable approach to acceleration. However, parallel and multicore are not the same thing. The cores in a multicore CPU generally share resources such as memory buses, FPU or vector units, or caches. Furthermore, I/O performance is also a bottleneck for processing large data. Similarly to hard disk drives but at a much faster scale, memory buses, including internal cache buses, are optimized for streaming access, so that they have a "seek time" and truly random memory access incurs a significant penalty.

All this means that even if your algorithm is perfectly parallelizable, having a 20-core CPU won't necessarily make it run 20 times faster.

The question is then, what kinds of programs will run 20 times faster?

First, let's talk about "HyperThreading". This is an Intel technology whereby a chip has, say, twice as many registers and program counters than the number of execution units. I didn't keep up with CPU architecture terminology, but this basically means that the uses sees, say, 20 CPUs and can use them as such, except that these 20 "surface CPUs" share 10 "real CPUs". Of course, how much individual hardware each surface CPU will vary according to CPU model and is probably part of Intel's trade secrets, but can probably be inferred easily by knowledgeable people from documentation and tests.

Nevertheless, my guess is that the surface CPUs must have at least their own x86 instruction decoders, and can probably execute at least branches and basic arithmetic individually. Is that a marketing gimmick, or does it make sense? Well most of the instructions a CPU executes in a general-purpose program are about setting up and tearing down stack frames, shifting things between registers, moving pointers and the occasional test, and lots of waiting for things to happen. In other words, computers are actually bureaucrats most of the time, doing administrative work. Hyperthreading is probably tuned to allow this kind of bureaucratic workload to be executed in parallel using the two surface CPUs.

However, since we're talking about serious computation that won't cut the mustard, and we will consider a hyperthreaded Intel CPU to have half the number of cores.

Note that if you are interested in code-breaking, the surface CPUs may be sufficient to run crypto algorithms. This brings the simples and most easily parallelizable workload. Identical, small and tight loops with no memory access and no synchronization at all. Each loop computes the next key, executes the encryption (or decryption) algorithm, performs a comparison and cycles since it hasn't found the key. If the crypto step is a typical symmetrical cipher or hash this requires only basic arithmetic (by design) and no memory I/O, so that one would expect to get maximum parallelization.

On the other hand, the worst workload would be a large floating-point dataset (say 500 MB) where, say, each task has to read a significant, random portion of the dataset, do some math on it and write the result. (Since we assumed at the very beginning that the problem is embarassingly parallel, there is no need for synchronization or to combine the results.) As the dataset won't fit in caches, it will create a lot of memory traffic, and since the address patterns are random, this will also tie up a lot of bus cycles.

One step up the code-breaking algorithm is an array filling task. You have a large array to be filled. Let's say it's an array of 64-bit integers. Each entry can be independently computed just from its index, but the computation can be more or less involved. Then the parallelized version will have one task filling its portion of the array. If computing an entry is only uses, say, the ALU, then there will be no contention for possibly floating-point hardware. Furthermore, if the memory writes are sequential with no synchronization and no one else reading the results, burst writes can be used by the hardware, so that performance will be limited by the peak memory write bandwidth. The write bandwidth will simply be a function of the number of CPU cycles required to calculate an entry. The smaller the number of cycles required to compute an entry, the higher the rate at which enrties are written.

Here is a C program for testing multicore performance that implements this idea:

#include <stdio.h>
#include <unistd.h>
#include <alloca.h>
#include <stdlib.h>
#include <inttypes.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <math.h>
#include <stdint.h>
#include <sys/time.h>

static inline int64_t f(int64_t x)
{
  x+=0x0123456789abcdef;
  x=((x >> 54)|(x << 10))^x;
  x*=128939784432432;
  return x;
}

void worker(int64_t m,int64_t *a,int n,int job,int parallel)
{
  int64_t i,i0,i1,j;
  size_t c = m/parallel;
  int64_t x;

  i0 = job*c;
  i1 = i0+c-1;
  if (i1 > m - 1) i1 = m-1;
  
  printf("job=%d i0=%"PRIi64" i1=%"PRIi64"\n", job, i0, i1);
  for (i = i0; i <= i1; i ++) {
    x=i;
    for (j = 0; j <= n; j ++) {
      x=f(x);
    }
    a[i]=x;
  }
}

int main(int argc, char **argv)
{
  int parallel;
  pid_t *pids, pid;
  int i, wstatus;
  size_t count,size;
  void *map;
  int64_t *a;
  int fd = -1;
  int iters;
  struct timeval tv;
  double t0,t1,dt;

  count = atoll(argv[1]);
  parallel = atoi(argv[2]);
  iters = atoi(argv[3]);
  size = count * sizeof(*a);
  size += 4095;
  size >>= 12;
  size <<= 12;
  printf("Allocating %zd MiB\n", size >> 20);

  fd = open("/dev/shm/testmap123", O_RDWR|O_CREAT|O_TRUNC, 0700);
  if (fd < 0) exit(100);
  if (ftruncate(fd, size) < 0) exit(101);
  map = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, fd, 0);

  if (map == MAP_FAILED) exit(100);
  a = map;

  gettimeofday(&tv, NULL);
  t0=tv.tv_sec + tv.tv_usec*(double)1e-6;

  pids = alloca(parallel * sizeof(pids));
  for (i = 0; i < parallel; i ++) {
    pid = fork();
    if (pid == 0) {
      worker(count,a,iters,i,parallel);
      return 0;
    } else {
      pids[i] = pid;
    }
  }

  for (i = 0; i < parallel; i ++) {
    pid = waitpid(pids[i], &wstatus, 0);
    if (pid < 0) {
      exit(101);
    }
  }
  gettimeofday(&tv, NULL);
  t1=tv.tv_sec + tv.tv_usec*(double)1e-6;
  dt=t1-t0;

  printf("Δt=%g %g/sec\n", dt, count/dt);

  return 0;
}

It can be compiled using gcc -Wall -O9 mmapcache.c -o mmapcache -march=native and is executed using ./mmapcache DATASIZE JOBS ITERS where DATASIZE is the size of the array (in words), JOBS is the number of processes to launch and ITERS is the number of iterations per entry.

Each entry is computed by iterating a simple arithmetic function f on a starting value made of the entry index and the job number.

On an Intel Xeon E5-2687W v4 (1.2 GHz) server, with 10 cores, the results are as follows:

With 100 iterations (size=441000000 or 336 MiB):

5 Mentry/s with 1 job (1×)
42 Mentry/s with 10 jobs (9×, size bumped up by 10×)
80 Mentry/s with 20 jobs (20×, same size as for 10×)

With 1 iteration (size=4410000000 or 3364 MiB):

295 Mentry/s with 1 job (1×)
908 Mentry/s with 4 jobs (3×)
1158 Mentry/s with 6 jobs (4×)
1222 Mentry/s with 8 jobs (4×)
1237 Mentry/s with 10 jobs (4×)
1246 Mentry/s with 20 jobs (4×)

The numbers are within 10 %

Two conclusions:

As expected, memory bandwidth seems to be a bottleneck, the bandwidth being about 10 GB/s at 1200 Mentry/s. This is puzzling given the DDR4 expected write bandwidths of 40-50 GB/s. For each process, the write addresses are sequential, and the segments are well-separated, so there should generally be no page conflicts between processes. I'm expecting the writes to be accumulated in each cache and then to be flushed in nice, large chunks maximizing bandwidth.
Hyperthreading works for basic ALU operations, so that one gets 20× speed-up with 20 hyperthreaded cores when memory or FPU operations are not a bottleneck.
Note that the MAP_POPULATE flag causes Linux to prepare the page tables before the timing starts. Initially, I wasn't using this option and the results were quite scattered.

Update:

Running a version using pthread_create() instead of fork() gives identical results.