Scary dark silicon is here today

| | Comments (2)

Mike Muller apparently delivered a bit of Halloween scare material at his keynote at ARM Techcon in Santa Clara yesterday. Rick Merritt of EETimes summed up the warning:

In a decade, 11nm process technology could deliver devices with 16 times more transistors running 2.4 times as fast as today's parts, said Mike Muller in a keynote address. But those devices will only use a third as much energy as today's parts, leaving engineers with a power budget so pinched they may be able to activate only nine percent of those transistors, Muller said.

Muller's claim sounds scary but, in reality, the idea of dark silicon is happening today, not ten years from now.

If you put a temperature sensor over a dense chip designed for a portable device today, only a fraction of it would seem to be in use at any one time. At the conference, ARM launched its Cortex-A5. And one thing that vice president of marketing Eric Schorn stressed was the way you could deploy multiple cores and run them slowly, or not at all, to save power.

It's a technique that Apple seems likely to use in its own-design ARM processor. As Francis Sideco of iSuppli pointed out last year using the example of a seemingly over-powered V8 engine: "The car companies have done the same thing. If you are cruising down the highway, they shut down four out of the eight cylinders to save fuel."

Schorn's example was a browser running four threads that might take up 80 per cent of the cycles on a single core. Pass those out to four cores and you could wind down the clock frequency. Not only that, you can reduce the voltage that feeds each core, which is where you get the real energy saving. On a 40nm process, this technique should yield a 50 per cent energy reduction. There will be losses through leakage so, as voltage scaling gets harder - at this level, you are talking about changes of 0.2V or so - you may take the alternative route of running threads as fast as they will go and then simply shutting down until there is something else to do.

The upshot of all this is that, even now, designers assume that large chunks of the die will be doing sod all at any given point in time. The idea of designing chips that are only 10 per cent active at any one time does not seem that unusual or even undesirable. Dedicated hardware is, in general, way more efficient than software running on a general-purpose processor. But it's wasteful of die area. But, if you know that transistors are cheap and getting cheaper and you can never run everything at once without melting the chip, why not throw hardware at the problem?

In practice, you will see a compromise between dedicated hardware and general-purpose processing. But the principle will be that, most of the time, that silicon will indeed be dark. On top of that, there will be circuit-level tricks that will eke more useful work out of the watts available. ARM is working on its own implementations of the Razor system and it's possible that, with lots of transistors available, sub-threshold switching used today in some medical devices, could be practical for a larger variety of systems.

2 Comments

This is definitely a growing trend, but is not really that new. It has always been true of most chips that only a small fraction of flip-flops are doing real work at any given time. From a functional point of view, chips have long been significantly 'dark'.
You can save a lot of power by exploiting this fact and gating the clock to flip-flops when they aren't switching. This is called "activity-driven clock gating" and can only be done at the gate level. It is quite distinct from the traditional RTL level clock gating, which is not activity-aware - the RTL tool simply swaps out a bank of recirculation muxes with a clock-gate and neither knows nor cares what the activity on those muxes is.
RTL clock gating is a good thing - it saves so much area it almost always ends up saving power. But activity data at the gate-level reveals up to 3X more opportunities for clock gating that cannot be seen at the RTL level. This allows you to save a lot more power. Check out Azuro's PowerCentric CTS tool for info on this increasingly common technique.

Thanks Marc,

That's a good point. Without it, CMOS could never have provided the densities that it does (and people cheerfully traded something like a 2.5x drop in chip density when they swapped from NMOS to CMOS during the 1980s in order to get where we are now). I wonder what Mike Muller's percentage - adjusted for leakage of course - would look like if the actual work time versus static state time were taken into account.