问题描述:

Yeah, I meant to say *80-bit*. That's not a typo...

My experience with floating point variables has always involved 4-byte multiples, like singles (32 bit), doubles (64 bit), and long doubles (which I've seen refered to as either 96-bit or 128-bit). That's why I was a bit confused when I came across an 80-bit extended precision data type while I was working on some code to read and write to AIFF (Audio Interchange File Format) files: an extended precision variable was chosen to store the sampling rate of the audio track.

When I skimmed through Wikipedia, I found the link above along with a brief mention of 80-bit formats in the IEEE 754-1985 standard summary (but not in the IEEE 754-2008 standard summary). It appears that on certain architectures "extended" and "long double" are synonymous.

One thing I haven't come across are specific applications that make use of extended precision data types (except for, of course, AIFF file sampling rates). This led me to wonder:

- Has anyone come across a situation where extended precision was necessary/beneficial for some programming application?
- What are the benefits of an 80-bit floating point number, other than the obvious "it's a little more precision than a double but fewer bytes than most implementations of a long double"?
- Is its applicability waning?

Intel's FPUs use the 80-bit format internally to get more precision for intermediate results.

That is, you may have 32-bit or 64-bit variables, but when they are loaded into the FPU registers, they are converted to 80 bit; the FPU then (by default) performs all calculations in 80 but; after the calculation, the result is stored back into a 32-bit or 64-bit variables.

BTW - A somewhat unfortunate consequence of this is that debug and release builds may produce slightly different results: in the release build, the optimizer may keep an intermediate variable in an 80-bit FPU register, while in the debug build, it will be stored in a 64-bit variable, causing loss of precision. You can avoid this by using 80-bit variables, or use an FPU switch (or compiler option) to perform all calculations in 64 bit.

For me the use of 80 bits is ESSENTIAL. This way I get high-order (30,000) eigenvalues and eigenvectors of symmetric matrices with four more figures when using the GOTO library for vector inner products, viz., 13 instead of 9 significant figures for the kind of matrices that I use in relativistic atomic calculations, which is necessary to avoid falling into the sea of negative-energy states. My other option is using quadruple-precision arithmetic that increases CPU time 60-70 times and also increases RAM requirements. Any calculation relying on inner products of large vectors will benefit. Of course, in order to keep partial inner product results within registers it is necessary to use assembler language, as in the GOTO libraries. This is how I came to love my old Opteron 850 processors, which I will be using as long as they last for that part of my calculations.

The reason 80 bits is fast, whereas greater precision is so much slower, is that the CPU's standard floating-point hardware has 80-bit registers. Therefore, if you want the extra 16 bits (11 extra bits of mantissa, four extra bits of exponent and one extra bit effectively unused), then it doesn't really cost you much to extend from 64 to 80 bits -- whereas to extend beyond 80 bits is extremely costly in terms of run time. So, you might as well use 80-bit precision if you want it. It is not cost-free to use, but it comes pretty cheap.

Wikipedia explains that an 80-bit format can represent an entire 64-bit integer without losing information. Thus the floating-point unit of the CPU can be used to implement multiplication and division for integers.

I used 80-bit for some pure math research. I had to sum terms in an infinite series that grew quite large, outside the range of doubles. Convergence and accuracy weren't concerns, just the ability to handle large exponents like 1E1000. Perhaps some clever algebra could have simplified things, but it was way quicker and easier to just code an algorithm with extended precision, than to spend any time thinking about it.

I have a friend that is working in that. He is working on a library to handle floating points of the size of gigabytes. Of course, is something related with scientific computing(calculations with plasma), and probably only this kind of computing works with numbers this big...

Another advantage not yet mentioned for 80-bit types is that on 16-bit or 32-bit processors which don't have floating-point units but do have a "multiply" instruction which produces a result twice as long as the operands (16x16->32 or 32x32->64), arithmetic on a 64-bit mantissa subdivided into four or two 16-bit or 32-bit registers will be faster than arithmetic on a 53-bit mantissa which spans the same number of registers but has to share 12 register bits with the sign and exponent. For applications which don't need anything more precise than `float`

, computations on a 48-bit "extended float" type could likewise be faster than computations on a 32-bit `float`

.

While some people might bemoan the double-rounding behavior of extended-precision types, that is realistically speaking only an issue in specialized applications requiring full bit-exact cross-platform reproducibility. From an *accuracy* standpoint, the difference between a rounding error of 64/128 vs 65/128, or 1024/2048ulp vs 1025/2048, is a non-issue; in languages with *extended-precision variable types* and *consistent extended-precision semantics*, use of extended types on many platforms without floating-point hardware (e.g. embedded systems) will offer both higher accuracy and better speed than single- or double-precision floating-point types.