Via Microsoft Research
-----
Abstract:
We present the first large-scale analysis of hardware failure rates on a
million consumer PCs. We find that many failures are neither transient
nor independent. Instead, a large portion of hardware induced failures
are recurrent: a machine that crashes from a fault in hardware is up to
two orders of magnitude more likely to crash a second time. For example,
machines with at least 30 days of accumulated CPU time over an 8 month
period had a 1 in 190 chance of crashing due to a CPU subsystem fault.
Further, machines that crashed once had a probability of 1 in 3.3 of
crashing a second time. Our study examines failures due to faults within
the CPU, DRAMand disk subsystems. Our analysis spans desktops and
laptops, CPU vendor, overclocking, underclocking, generic vs. brand
name, and characteristics such as machine speed and calendar age. Among
our many results, we find that CPU fault rates are correlated with the
number of cycles executed, underclocked machines are significantly more
reliable than machines running at their rated speed, and laptops are
more reliable than desktops.
The full document@Microsoft Research