We present and analyze several simple algorithms for accurately summing n floating point numbers, independent of how much cancellation occurs in the sum. Let f be the number of significant bits in the sum (si). We assume a register is available with F > f significant bits. Then assuming that (1) n <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the si in decreasing order of magnitude yields S rounded to within just over 1.5 units in its last place. If S = 0, then it is computed exactly. If we increase n slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (F >= 2f). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as n <= 33. If double extended is used n can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.




Download Full History