Updates on the need to use error-correcting memory

In my previous post "On the need to use error-correcting memory" I used per-bit upset rates given in Eugene Normand's paper to present-day memory sizes to obtain a likely bit-error-per-three-days figure for 4 gigabytes of RAM. My aim was to make a rational decision about using ECC memory or not. It has been pointed out that this computation is invalid since it applies to old memory chips which had much lower densities. Strangely, it is quite difficult to find reliable figures about modern DRAM chips so I re-did some calculations. Also, partial data released from Google indicates that you have about one chance in 3 to have a computer that gives much more memory errors than what my initial computation gives - but this time, not because of atmospheric neutrons, but because of bad hardware. Still, ECC helps even in that case.

Previous claims

Based on sea-level bit upset probabilities given in Eugene Normand's SEU at ground level paper, I computed that if you have 4 GiB of memory, you have 96% chance of getting a bit flip in three days because of cosmic rays. SECDED ECC would reduce that to a negligible one chance in six billion.

Criticisms

The post has been discussed on Reddit and at Y Combinator and a number of criticisms have been addressed.

The Toronto paper

The paper DRAM Errors in the Wild: A Large-Scale Field Study gives interesting statistics about error rates during a timespan of two and a half years on the "majority" of Google's deployed servers. As its lead author, Bianca Schroeder, is from the University of Toronto, I'll call that the "Toronto paper".

One major drawback with the Toronto paper is that important data such as the number of servers, or the exact RAM capacity, or RAM usage information of the servers is not disclosed.

The most interesting thing that we learn from that paper is that cosmic ray-induced errors are negligible. Indeed, if we are to believe them, two-thirds of their machines ( 67.8% ) detected absolutely no memory errors in two and a half years, in spite of having gigabytes of memory. The exact number of machines is undisclosed, but is believed to lie between 10 000 and 500 000 ; the average amount of RAM per machine is also undisclosed, as is the average RAM utilization.

I don't think Google runs a major portion of their servers in deep underground vaults beneath meters of rock (that would shield them from atmopsheric neutrons). Nor do I think that Google uses only a few hundred megabytes of memory on 60% of its servers (that would reduce the memory footprint and thus the overall susceptibility).

Therefore, the only conclusion is that the hourly bit upset rate due to neutrons has diminished a thousandfold since Normand's study. This can be due to two factors : size and sensitivity. The neutron-driven bit-error rate will be defined by the effective cross-section of your RAM cells with respect to the neutron flux, and their sensitivity. The cross-section depends chiefly on bit density. While chip area might have increased by a factor of 10 in 15 years, capacity has increased thousandfold. Let's try to correctly extrapolate bit upset rates by taking that into account.

Atmospheric neutron flux at sea level

Apparently we receive, at sea level, around 20 atmospheric neutrons per square centimeter per hour. They mostly come vertically . So it might be best to place your DIMMs so that the chips are vertical (to minimize cross-section), but not on top of eachother (to reduce the number of chips affected by a neutron).

Susceptibility per unit fluence

In table 5 of Normand's paper, we see that the CRAY YMP-8 computer uses DRAM similar to the TMS44100 which is a 4 MiBit (4 194 304 bit) DRAM.

From DRAM data at IC Knowledge we see that the die size for such a chip is probably one of 23, 44 or 105 square millimeters.

For a chip of 105 mm² (resp 44, 23 mm²), assuming that bits take most of the area of the chip, we have areas of approximately 25 µm² (resp. 10, 5.5 µm²) per bit. The upset rate is given as 1.8e-12. At 20 neutrons per square centimeter per hour, this gives 5e-6, 2.1e-6 and 1.1e-6 neutrons per bit per hour, and thus susceptibilities of 3.6e-7, 8.6e-7 and 1.6e-6 upsets per neutron.

In other words, you have to bombard a single cell with an average of 2.7 million, 1.2 million or 625 thousand neutrons to flip a bit on a chip similar to a TMS44100.

Now let's take a modern computer with 2 GiB of memory, organized as two DIMMs of 8 chips each, for a total of 16 1-Gibit chips. From the same page on DRAM sizes we see that our chips may have dies ranging from 400 to 103.7 mm². As we can see, dies have got bigger. But cell sizes went down to 0.372 or 0.096 µm². Thus we have 7.44e-8 down to 1.92e-8 neutrons per cell per hour. If we assume a similar susceptibility, we'll get from 6.9e-15 to 1.19e-13 upsets per bit per hour.

With 16 Gibit of memory, that gives 1.18e-4 to 2e-3 upsets per hour, i.e. a bit error every 500 to 8500 hours, that is, every 21 to 354 days.

Under these assumptions, you'll have to wait about 33 to 600 days to get a 96% chance of getting a bit error.

The low end rate of 6.9e-15 neutron upsets per bit per hour is thus compatible with data reported in the Toronto paper.

Conclusions and recommendations

The average hourly bit upset rate reported in the Toronto paper is 3751 correctable errors per DIMM per year, and 22 692 errors per machine per year, on average.

We don't know the size distribution of DIMMs they use, but we know they used 1, 2 and 4 GiB DIMMMs. This that gives rates of 1.38e-12 to 5.54e-12 upets/bit/hour, which is, interestingly, similar to Normand's reported cosmic error rates.

The difference is that a large fraction of their DIMMs experience no errors at all for years. This seems to rule out cosmic rays as a cause for most of the errors. Indeed, the coefficient of variation accross machines of the error rates is huge, meaning that there is a strong machine effect. Maybe some of their machines are contaminated by radioactive materials, subject to abnormally strong electromagnetic fields, or simply substandard.

Nevertheless, this means that when you buy a computer, you are playing "DIMM" (and motherboard) roulette . You have something like one chance in three to ten of getting a computer that will experience memory errors at the frightening frequencies (one every few days) that I talked about but attributed to cosmic rays, and that AMD talks about in their whitepaper .

Fortunately, ECC memory also protects against single bit errors that aren't due to cosmic rays. And if you get lots of bit errors, SECDED ECC won't protect you, but will end up detecting them.

The conclusion is the same : you ABSOLUTELY need to use ECC memory if you intend to use your computer for any kind of remotely serious use. Testing your computer for a few weeks isn't even sufficient since errors may manifest themselves later in life, when you have much more data to lose. In fact, Microsoft recommends ECC for Vista ; in other words, ECC is no more for servers and aerospace.

If you insist on not using ECC memory, here are a few tips:

2010-02-07