﻿ Double-precision floating-point format
Wed . 20 Aug 2020
TR | RU | UK | KK | BE |

# Double-precision floating-point format

double-precision floating-point format fp16, double-precision floating-point format
Double-precision floating-point format is a computer number format that occupies 8 bytes 64 bits in computer memory and represents a wide, dynamic range of values by using a floating point

Double-precision floating-point format usually refers to binary64, as specified by the IEEE 754 standard, not to the 64-bit decimal format decimal64 In older computers, different floating-point formats of 8 bytes were used, eg, GW-BASIC's double-precision data type was the 64-bit MBF floating-point format

Floating point precisions

The exponents 00016 and 7ff16 have a special meaning:

• 00016 is used to represent a signed zero if F=0 and subnormals if F≠0; and
• 7ff16 is used to represent ∞ if F=0 and NaNs if F≠0,

where F is the fractional part of the significand All bit patterns are valid encoding

Except for the above exceptions, the entire double-precision number is described by:

− 1 sign × 2 e − 1023 × 1 fraction \times 2^\times 1

In the case of subnormals e=0 the double-precision number is described by:

− 1 sign × 2 1 − 1023 × 0 fraction = − 1 sign × 2 − 1022 × 0 fraction \times 2^\times 0=-1^\times 2^\times 0

### Endiannessedit

This section is transcluded from Endianness edit | history

Although the ubiquitous x86 processors of today use little-endian storage for all types of data integer, floating point, BCD, there are a few historical machines where floating point numbers are represented in big-endian form while integers are represented in little-endian form2 There are old ARM processors that have half little-endian, half big-endian floating point representation for double-precision numbers: both 32-bit words are stored in little-endian like integer registers, but the most significant one first Because there have been many floating point formats with no "network" standard representation for them, the XDR standard uses big-endian IEEE 754 as its representation It may therefore appear strange that the widespread IEEE 754 floating point standard does not specify endianness3 Theoretically, this means that even standard IEEE floating point data written by one machine might not be readable by another However, on modern standard computers ie, implementing IEEE 754, one may in practice safely assume that the endianness is the same for floating point numbers as for integers, making the conversion straightforward regardless of data type Small embedded systems using special floating point formats may be another matter however

### Double-precision examplesedit

 0 01111111111 00000000000000000000000000000000000000000000000000002 ≙ +20·1 = 1 0 01111111111 00000000000000000000000000000000000000000000000000012 ≙ +20·1 + 2−52 ≈ 10000000000000002, the smallest number > 1 0 01111111111 00000000000000000000000000000000000000000000000000102 ≙ +20·1 + 2−51 ≈ 10000000000000004 0 10000000000 00000000000000000000000000000000000000000000000000002 ≙ +21·1 = 2 1 10000000000 00000000000000000000000000000000000000000000000000002 ≙ −21·1 = −2
 0 10000000000 10000000000000000000000000000000000000000000000000002 ≙ +21·112 = 112 = 3 0 10000000001 00000000000000000000000000000000000000000000000000002 ≙ +22·1 = 1002 = 4 0 10000000001 01000000000000000000000000000000000000000000000000002 ≙ +22·1012 = 1012 = 5 0 10000000001 10000000000000000000000000000000000000000000000000002 ≙ +22·112 = 1102 = 6 0 10000000011 01110000000000000000000000000000000000000000000000002 ≙ +24·101112 = 101112 = 23
 0 00000000000 00000000000000000000000000000000000000000000000000012 ≙ +2−1022·2−52 = 2−1074 ≈ 49·10−324 Min subnormal positive double 0 00000000000 11111111111111111111111111111111111111111111111111112 ≙ +2−1022·1 − 2−52 ≈ 22250738585072009·10−308 Max subnormal double 0 00000000001 00000000000000000000000000000000000000000000000000002 ≙ +2−1022·1 ≈ 22250738585072014·10−308 Min normal positive double 0 11111111110 11111111111111111111111111111111111111111111111111112 ≙ +21023·1 + 1 − 2−52 ≈ 17976931348623157·10308 Max Double
 0 00000000000 00000000000000000000000000000000000000000000000000002 ≙ +0 1 00000000000 00000000000000000000000000000000000000000000000000002 ≙ −0 0 11111111111 00000000000000000000000000000000000000000000000000002 ≙ +∞ positive infinity 1 11111111111 00000000000000000000000000000000000000000000000000002 ≙ −∞ negative infinity 0 11111111111 10000000000000000000000000000000000000000000000000002 ≙ NaN 0 11111111111 11111111111111111111111111111111111111111111111111112 ≙ NaN an alternative encoding
 0 01111111101 01010101010101010101010101010101010101010101010101012 = 3fd5 5555 5555 555516 ≙ +2−2·1 + 2−2 + 2−4 + + 2−52 ≈ 1/3

By default, 1/3 rounds down, instead of up like single precision, because of the odd number of bits in the significand

In more detail:

Given the hexadecimal representation 3FD5 5555 5555 555516, Sign = 0 Exponent = 3FD16 = 1021 Exponent Bias = 1023 constant value; see above Fraction = 5 5555 5555 555516 Value = 2Exponent − Exponent Bias × 1Fraction – Note that Fraction must not be converted to decimal here = 2−2 × 15 5555 5555 555516 × 2−52 = 2−54 × 15 5555 5555 555516 = 0333333333333333314829616256247390992939472198486328125 ≈ 1/3

### Execution speed with double-precision arithmeticedit

Using double precision floating-point variables and mathematical functions eg, sin, cos, atan2, log, exp and sqrt are slower than working with their single precision counterparts One area of computing where this is a particular issue is for parallel code running on GPUs For example, when using NVIDIA's CUDA platform, on video cards designed for gaming, calculations with double precision take 3 to 24 times longer to complete than calculations using single precision4

## Implementationsedit

Doubles are implemented in many programming languages in different ways such as the following On processors with only dynamic precision, such as x86 without SSE2 or when SSE2 is not used, for compatibility purpose and with extended precision used by default, software may have difficulties to fulfill some requirements

### C and C++edit

C and C++ offer a wide variety of arithmetic types Double precision is not required by the standards except by the optional annex F of C99, covering IEEE 754 arithmetic, but on most systems, the double type corresponds to double precision However, on 32-bit x86 with extended precision by default, some compilers may not conform to the C standard and/or the arithmetic may suffer from double-rounding issues5

### Common Lispedit

Common Lisp provides the types SHORT-FLOAT, SINGLE-FLOAT, DOUBLE-FLOAT and LONG-FLOAT Most implementations provide SINGLE-FLOATs and DOUBLE-FLOATs with the other types appropriate synonyms Common Lisp provides exceptions for catching floating-point underflows and overflows, and the inexact floating-point exception, as per IEEE 754 No infinities and NaNs are described in the ANSI standard, however, several implementations do provide these as extensions

### JavaScriptedit

As specified by the ECMAScript standard, all arithmetic in JavaScript shall be done using double-precision floating-point arithmetic6

### Luaedit

In Lua version 527 and earlier, all arithmetic is done using double-precision floating-point arithmetic Also, automatic type conversions between doubles and strings are provided

• IEEE floating point, IEEE standard for floating-point arithmetic IEEE 754

## Notes and referencesedit

1. ^ William Kahan 1 October 1997 "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" PDF
2. ^ "Floating point formats"
3. ^ "pack – convert a list into a binary representation"
4. ^ http://wwwtomshardwarecom/reviews/geforce-gtx-titan-gk110-review,3438-3html
5. ^ GCC Bug 323 - optimized code gives strange floating point results
6. ^ ECMA-262 ECMAScript Language Specification PDF 5th ed Ecma International p 29, §85 The Number Type
7. ^ http://wwwluaorg/manual/52/manualhtml

double-precision floating-point format, double-precision floating-point format fp16

## Double-precision floating-point format Information about

• Double-precision floating-point format beatiful post thanks!

29.10.2014

Double-precision floating-point format
Double-precision floating-point format
Double-precision floating-point format viewing the topic.

### Book

A book is a set of written, printed, illustrated, or blank sheets, made of ink, paper, parchment, or...