Go to main content

PDF

Description

We present and analyze several simple algorithms for accurately summing n floating point numbers, independent of how much cancellation occurs in the sum. Let f be the number of significant bits in the sum (si). We assume a register is available with F > f significant bits. Then assuming that (1) n <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the si in decreasing order of magnitude yields S rounded to within just over 1.5 units in its last place. If S = 0, then it is computed exactly. If we increase n slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (F >= 2f). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as n <= 33. If double extended is used n can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.

Details

Files

Statistics

from
to
Export
Download Full History
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS