Friday, 29 August 2008

Multicore Multifiasco

Dave Patterson is one of the most respected names in computer architecture, and rightly so, having been one of the co-founders of the RISC movement, as well as being one of the prime movers behind RAID, among other things.

Nonetheless, when he posted an article on the potential and pitfalls of multicore designs on the Computing Community Consortion blog, I couldn't help but post the following, repeated here with a few additions.

It’s my experience that every decade or so, a packaging breakthrough allows some previously forgotten or abandoned approach to be resurrected at a lower price point, and all the previous lessons are forgotten.

Here are a few of those lessons:

  1. parallel programming is inherently hard, and tools and techniques claiming to avoid the problems never work as advertised
  2. heterogeneous and asymmetric architectures are much harder to program effectively than homogeneous symmetric architectures
  3. Programmer-managed memory is much harder to use than system-managed memory (whether by the operating system or by hardware)
  4. Specialist instruction sets are much harder to use effectively than general-purpose ones

I nearly fell out of my chair laughing when the Cell processor was launched. It contained all four of these errors in one design. Despite the commercial advantages in bringing out a game that fully exploited its features, as I recall, only one game available when the Playstation 3 was launched came close.

I hereby announce Machanick’s Corollory to Moore’s Law: any rate of improvement in the number of transistors you can buy for your money will be matched by erroneous expectations that programmers will become smarter.

Unfortunately there is no Moore’s Law for IQ.

The only real practical advantage of multicore over discrete chip multiprocessors (aside from the packaging and cost advantages) is a significant reduction of interprocess communication costs — provided IPC is core-to-core, i.e., if you communicate through shared memory, you’d better make sure that the data is cached before the communication occurs. That makes the programming problem harder, not easier (see Lesson 3 above).

Good luck with transactional memories and all the other cool new ideas. Ask yourself one question: do they make parallel programming easier, or do they add one more wrinkle for programmers to take care of — that may be different in the next generation or on a rival design?

Putting huge numbers of cores on-chip is a losing game. The more you add, the smaller the fraction of the problem space you are addressing, and the harder you make programming. I would much rather up the size of on-chip caches to the extent that they effectively become the main memory, with off-chip accesses becoming page faults (I call this notion RAMpage). Whether you go multicore or aggressive uniprocessor, off-chip memory is a major bottleneck.

As Seymour Cray taught us, the thing to aim for is not peak throughput, but average throughput. 100 cores each running at 1% of its full speed because or programming inefficiencies, inherent nonparallelism of the workload and bottlenecks in the memory system is hardly an advance on two to four cores each running at at least 50% of its full speed.

In any case all of this misses the real excitement in the computing world: turn Moore’s Law on its head, and contemplate when something that cost $1,000,000 will cost $1. That’s the point where you can do something really exciting on a small, almost free device.

Further reading

My PhD thesis, completed in 1996, but now relevant to a wider audience since multicore has become mainstream, is available on Amazon. See Other Links.

Some interesting stuff here: The Perils of Parallel: Vive la (Killer App) Révolution!