home | alphabetical index | |||||||

## Floating point
A In a floating-point number, the number of significant digits (the relative precision) has a maximum, rather than the number of digits after the radix point (the absolute precision) as in fixed-point.
## Representation
A floating-point number b (called the base of numeration, also the radix) and a precision p (how many digits to store).
m (which is called the significand or, informally, mantissa) is a p digit number of the form ±d.ddd...ddd (each digit being an integer between 0 and b−1 inclusive). If the leading digit of m is non-zero then the number is said to be normalized. Some descriptions use a separate sign bit (s, which represents −1 or +1) and require m to be positive.
e is called the exponent.This scheme allows a large range of magnitudes to be represented within a given size of field, which is not possible in a fixed-point notation. As an example, a floating-point number with four decimal digits (b=10, p=4) and an exponent range of ±4 could be used to represent 43210, 4.321, or 0.0004321, but would not have enough precision to represent 432.123 and 43212.3 (which would have to be rounded to 432.1 and 43210). Of course, in practice, the number of digits is usually larger than four. ## Hidden bitWhen using binary (b=2), one bit can be saved if all numbers are required to be normalized. The leading digit of the significand of a normalised binary floating-point number is always non-zero, in particular it is always 1. This means that it does not need to be stored explicitly, for a normalised number it can be understood to be 1. The IEEE 754 standard exploits this fact. Requiring all numbers to be normalised means that 0 cannot be represented; typically some special representation of zero is chosen. In the IEEE standard this special code also encompasses denormal numbers, which allow for gradual underflow.## Usage in computingWhile in the examples above the numbers are represented in the decimal system (that is the base of numeration,b = 10, computers usually do so in the binary system, which means that b = 2). In computers, floating-point numbers are sized by the number of bits used to store them. This size is usually 32 bits or 64 bits, often called "single-precision" and "double-precision". A few machines offer larger sizes; Intel FPUs such as the Intel 8087 (and its descendants integrated into the x86 architecture) offer 80 bit floating point numbers for intermediate results, and several systems offer 128 bit floating-point, generally implemented in software.## Problems with floating-pointFloating-point numbers usually behave very similarly to the real numbers they are used to approximate. However, this can easily lead programmers into over-confidently ignoring the need for numerical analysis. There are many cases where floating-point numbers do not model real numbers well, even in simple cases such as representing the decimal fraction 0.1, which cannot be exactly represented in any binary floating-point format. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/ Errors in floating-point computation can include: - Rounding
- Non-representable numbers: for example, the literal 0.1 cannot be represented exactly by a binary floating-point number
- Rounding of arithmetic operations: for example 2/3 might yield 0.6666667
- Absorption: 1×10
^{15}+ 1 = 1×10^{15} - Cancellation: subtraction between nearly equivalent operands
- Overflow / Underflow
## IEEE standardThe IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which have both hexadecimal and IEEE 754 data types, and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format. The IEEE 754 standard is currently (2004) under revision. See: http://grouper.ieee.org/groups/754/ ## Examples- The value of Pi, &pi = 3.1415926...
_{10}decimal, which is equivalent to binary 11.001001000011111..._{2}. When represented in a computer that allocates 17 bits for the significand, it will become 0.11001001000011111 × 2^{2}. Hence the floating-point representation would start with bits 01100100100001111 and end with bits 01 (which represent the exponent 2 in the binary system). Note: the first zero indicates a positive number, the ending 10_{2}= 2_{10}.) - The value of −0.375
_{10}= 0.011_{2}or 0.11 × 2^{−1}. In two's complement notation, −1 is represented as 11111111 (assuming 8 bits are used in the exponent). In floating-point notation, the number with start with a 1 for the sign bit, followed by 110000... and then followed by 11111111 at the end, or 1110...011111111 (where ... are zeros).
Note that although the examples in this article use a consistent system of floating-point notation, the notation is different from the IEEE standard. For example, in IEEE 754, the exponent is between the sign bit and the significand, not at the end of the number. Also the IEEE exponent uses a biased integer instead of a two's complement number. The reader should note that the examples serve the purpose of illustrating how floating-point numbers could be represented, but the actual bits shown in the article are different from those in a IEEE-754-compliant representation. The placement of the bits in the IEEE standard enables two floating-point numbers to be compared bitwise (sans sign-bit) to yield a result without interpreting the actual values. The arbitrary system used in this article cannot do the same. Some good wikipedians with spare time can rewrite the examples using the IEEE standard if desired, though the current version is good enough as textbook examples for it highlights all the major components of a floating-point notation. This also illustrates that a non-standard notation system also works as long as it is consistent.
## References- Kahan, William (2001). How Java's floating-point hurts everyone everywhere. Retrieved Sep. 5, 2003 from http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf
- an edited reprint of the paper
*What Every Computer Scientist Should Know About Floating-Point Arithmetic*, by David Goldberg, published in the March, 1991 issue of Computing Surveys
| |||||||

copyright © 2004 FactsAbout.com |