Soft error rates in modern computers

This post supersedes my previous two posts on memory soft errors.

Manufacturers of computer memory modules (typially DIMM SDRAMs) do not provide to the general public reliability data, especially data relevant to the susceptibility of their memory modules to radiation effects. I therefore try to estimate this data from what is available on the open web.

This is the third post on the subject, see the previous two posts: the first and the second ones.

Cosmic rays and atmospheric neutrons

Cosmic rays are particles such as protons and electrons, that travel thru space at great velocities. Their origin is not well understood. Only the most energetic of them reach the terrestrial surface, most of them being stopped by our atmosphere, producing so-called secondary neutrons. These neutrons can reach the ground. While the proportion of neutrons of a given energy level is roughly constant regardless of longitude, latitude or altitude within the atmosphere, their total count varies. According to Ziegler (1998) at ground level, 96% of the particles that reach the ground are neutrons, 3% are pions and 3% are protons, and at an altitude of 10 km, pions account for 36% and protons for 12% of the total particles.

These neutrons are called atmospheric neutrons. If they have a sufficient energy level, atmospheric neutrons can affect electronics.

At sea level, an average of 10 neutrons of energies above 1 MeV hit each square centimeter each hour. This rate goes to 3000 neutrons per square centimeter per hour at an altitude of 10 km. Neutrons of energies less than 1 MeV do not affect electronics, unless they are converted to something else, for instance by neutron capture by the boron isotope 10B.

What kind of effects can neutrons create in electronics?

Neutrons (as well as other kinds of particles such as protons and alpha particles) can create transient faults as well as permanent damage. Permanent damage is unlikely with atmospheric neutrons.

In memory cells, neutrons can change the contents of memory cells, i.e. cause a bitflip. These are called single event upsets (SEU). One neutron can affect one or more neighbouring cells. The memory cell continues to behave normally, but with an altered content. When the content of the cell is rewritten, everything goes back to normal.

Neutrons can also create single event latchups (SEL). In these, a sequential element such as a latch locks up and starts consuming an abnormally high amount of current. Power-cycling the device is usually needed to reset the element, unless the element ends up destroying itself.

How much do atmospheric neutrons affect electronics?

SEU cross-section

The susceptibility of a device is the probability that it will fault per unit fluence. A fluence is expressed in particles per square centimeter, where the energy distribution of the particles is agreed upon. Since a probability has no unit, it follows that susceptibility should have the inverse unit of fluence, that is square centimeters per particle. As "particle" doesn't have a unit either, the susceptibility is expressed just in square centimeters. In other words, it is an area. That area is called the cross-section. For SEUs, it is called the SEU cross-section.

If a device has a SEU cross-section of 0.001 cm² for neutrons of amospheric energy distribution, and if it is subject to a flux of 20 neutrons per square centimeter per hour, then it will experience an average of 20 × 0.001 = 0.02 upsets per hour.

SRAM, DRAM and technology

Random-access memories (RAM) come in SRAM (Static RAM) and DRAM (Dynamic RAM) varieties. SRAM basically stores a bit in a flip-flop made of a few transistors, while DRAM stores a bit in a capacitor driven by a single transistor. The electric charge stored in the capacitor defines the bit it contains ; if it is less than half-full, then it stores a 0, otherwise it stores a 1. Since the capacitor inevitably leaks, DRAM cells need to be refreshed regularly (usually every few ms) by refilling them if they contain a 1.

Integrated circuits are made by etching silicon wafers; the basic precision of the etching is called the track width or technology and defines the density, that is the number of elements (transistors, capacitors) per unit area that can be placed.

As memory cells get smaller, the probability that a given cell will be hit by a neutron also gets smaller; but memory sizes tend to increase proportionally to cell density. We basically have the same number and area of memory chips as 10 or 20 years ago, but the capacities of the chips have increased a thousand or million-fold. One memory chip has approximately one square centimeter of die area, while an average computer has usually 8 to 16 chips. So our typical computers have somewhere around ten square centimeters of memory since two decades.

So the probability that a computer's memory will be hit by a neutron is basically the same today as it was ten or twenty years ago. But at 10 neutrons per square centimeter per hour, surely not every neutron causes an upset. Could smaller and newer memory cells be more sensitive?

Technology and sensitivity of memory cells

Here is the crux of the matter. Elements with no explicit capacitance such as logic gates and thus SRAM memory cells got more sensitive as sizes and voltages decrease. On the other hand, manufacturers have managed to keep the capacitor value of DRAM cells in the 50-200 fC (femto coulomb) range, in spite of order-of-magnitude reductions in size, thus maintaining their susceptibility.

Failure in time values for DRAMs and SRAMs

So what are the actual values?

Vendors do not provide generally provide data on DRAM soft error sensitivity to the public.

The SEU at ground level (199?) paper by Eugene Normand gives upset rates of about 2.1 upsets/bit/hour for 4Mbit DRAMs, which translates to 3 523 FIT per chip.

In 1996, J.F.Ziegler et al, in "Cosmic Soft Ray Error Rates of 16 Mb DRAM Memory Chips" obtained values of 8e-15, 200e-15 and 2000e-15 soft errors per bit per hour for three kinds of DRAM chips, which translates to FIT values of 134, 3 355 and 33 554 per chip.

According to "Memory systems: cache, DRAM, disk" (2007) , p.33, 130 nm DRAMs devices experience 1000 FIT, while 130 nm SRAMs experience 100 000 FITs due to soft errors. Page 34, they say thay Mosys claimed that SRAMs had about 100 000 FIT per megabit at 250 nm and 100 000 000 FIT per megabit at 130 nm, an astonishing rate that is probably a typo. If not, your average desktop CPU with its 2.5 MB of cache might experience one bit flip every half an hour.

Eishi Ibe, in "Terrestrial Neutron-induced Soft Errors In Advanced Memory Devices" (2008) gives SEU cross sections ranging from 1e-9 to less than 3e-7 cm² for 256 Mbit DRAMs, which maps to values of 10 to 3000 failure in time per billion device hours (FIT), with a decreasing trend. Even with the highest FIT value and 16 chips, the average computer will not experience a SEU for two years. On the other hand, for SRAMs, the cross section goes from 1e-7 to 5e-6 for 8 Mbit devices, i.e. failure rates from 1000 to 50000. But what little SRAM you have in your computer is mostly on your CPU and is mostly ECC-protected.

According to R.C. Baumann in "Radiation-Induced Soft Errors in Advanced Semiconductor Technologies", the DRAM soft error rate per chip has remained fairly constant from 800nm to 100nm processes, white the SRAM soft error rate per bit has also remained constant. But he does not provide a base rate for DRAM. If we are to take Ziegler's values, then we expect a value between 100 (for DRAM chips having trench cells with internal cell) and 30 000 FIT per chip.

An October 2009 AMD document, "The Value of Using ECC Memory In Embedded Applications" talks of one bitflip per 2-4 weeks per GB of RAM. Assuming 16 chips for a GB, this amounts to an astonishing 92 000 to 186 000 FIT per chip. However the AMD document gives no source for that claim.

It seems that a typical FIT value of 1500 per chip is reasonable based on the litterature. This would give one atmospheric neutron-induced soft-error every 4 to 5 years.

Conclusion

2010-02-07