Since Perl v5.22 added hexadecimal floating-point numbers, I investigated how floating point numbers are actually represented. These are specified in IEEE 754 and are something you’ve probably taken for granted. I can use Inline::C to play with these.
Perl stores floating point numbers as “doubles”, a particular sort of floating-point number. More correctly, it’s a double-precision floating-point number.
For a 64-bit floating-point number, I don’t get 64 bits to represent the number. I get 52 bits (not quite true, but true enough for now). The other 12 bits represent the sign and the exponent:
Bits | Total bits | Range | Use |
---|---|---|---|
63 | 1 | 0 to 1 | Sign bit |
62-52 | 11 | 0 to 2047 | Exponent |
51-0 | 52 | 0 to 251 | Mantissa |
And, I can’t (well, not me personally, but maybe someone else can) look at these bits and tell you want the number is. Before I show the rules to turn these bits into a number, I want to see the actual bits.
In my Intermediate Perl class, I briefly introduce the Inline::C module. This bit of magic turns C code, which I put right in my Perl source, into Perl functions. It does various type conversions too. For example, Perl stores its values in variables with lots of extra bookkeeping. I can pass a scalar variable to a C function without worrying about the Perl types:
use Inline C => <<'HERE'; int add ( int n, int m ) { return n + m; } HERE say "2 + 2 is ", add( 2, 2 );
I can use this to look at the representation of a double directly since C makes it very easy to play with its data types. I can create some C functions to look at the bits, even if my C is a bit rusty. The trick in each function is to extract the bits the represent each portion. C lets me look at a chunk of memory any way I like, so I treat a double as an array of unsigned chars on which I'll do bit operations:
use Inline C => <<'HERE'; unsigned short sign_bit ( double d ) { unsigned char *b = (unsigned char*) &d; unsigned int size = sizeof(double); unsigned short sign; sign = b[ size - 1 ] & 0b10000000 ? 1 : 0; return sign; } unsigned short exponent ( double d ) { unsigned char *b = (unsigned char*) &d; unsigned int size = sizeof(double); unsigned short high, low; high = ( b[ size - 1 ] & 0b01111111 ) << 4; low = ( b[ size - 2 ] & 0b11110000 ) >> 4; return high + low; } unsigned long fractional ( double d ) { unsigned char *b = (unsigned char*) &d; unsigned char size = sizeof(double); unsigned long value; unsigned char i; value = (unsigned long)( b[ size - 2 ] & 0b00001111 ) << 48; for( i = 3; i <= 8; i++ ) { value += (unsigned long)(b[ size - i ]) << ( 64 - 8 * i ); } return value; } HERE
With that, I can get the values back into Perl:
my $sign = sign_bit( $ARGV[0] ); my $exponent = exponent( $ARGV[0] ); my $fraction = fractional( $ARGV[0] ); say <<"HERE"; Sign: $sign Exponent: $exponent Fraction: $fraction HERE printf "%b %011b %052b\n", $sign, $exponent, $fraction;
Now I see the actual bit pattern separated into the interesting parts of the double format:
$ ./show_bits 137.035999 Sign: 0 Exponent: 1030 Fraction: 317925951010314 0 10000000110 0001001000010010011011100111010111111111011000001010
Now I have to connect this to the actual number. It's a bit tricky. I have to look at the exponent first.
The exponent is 2047
If all the bits in the exponent are set, then the value is not a number. The special not-a-number depends on the rest of the double. If the fractional portion has a bit set, the value is NaN
. If the factional portion has no set bits, then it's an infinity. With the sign bit set, it's -Inf
and Inf
otherwise. My program supports that:
$ ./one-shot.pl NaN Sign: 0 Exponent: 2047 Fraction: 2251799813685248 0 11111111111 1000000000000000000000000000000000000000000000000000 $ ./one-shot.pl Infinity Sign: 0 Exponent: 2047 Fraction: 0 0 11111111111 0000000000000000000000000000000000000000000000000000 $ ./one-shot.pl -Infinity Sign: 1 Exponent: 2047 Fraction: 0 1 11111111111 0000000000000000000000000000000000000000000000000000
Exponent is not 0 or 2047
The more common case is an exponent that is not 0
or 2047
. Ignore the 0
case for a moment.
In my 137.035999
, the exponent is 1030
. After subtracting the bias of 1023
, that's 7
.
To get the number, the fraction part is appended to 1.
. For example, for 137.035999
example, the fraction part is 0001001000010010011011100111010111111111011000001010
. In hex, that's 0x12126e75ff60a
. As a hexadecimal literal, the prefix and the factional portion is 0x1.12126e75ff60ap0
. Perl v5.22 understands this notation and I can turn it into a Perl number with an eval
:
use v5.22; my $F = eval '1x0." . sprintf( "%013X", $fraction ) . 'p0'; my $value = (-1)**$sign * 2 ** ($exponent-1023) * $F;
Notice there's some extra there. The double assumes a leading 1
, which I get for free. That's a bit more than the 52 bits I said I had to represent the number.
Exponent is 0
If the exponent is not 0, then the number is either "unnormalized" or 0. And, that zero can be -0
or 0
. That -0
might seem silly, but I'm used to it as a mathematical limit approached from the negative side of the number line. But also consider that there's a sign bit. As you add to a negative number, it approaches 0
but might not get close enough to cross 0
. It might not get to flip that sign bit. The two values -0
and 0
are equal, though.
To get the number, the fraction part is appended to 0x0.
. In the previous case, that was 0x1.
. This means that we get to use all the same bits again to represent even smaller numbers (starting around 2.225073858507202e-308
, or, as hex, 0x1.fffffffffffffp-1023
);
$ ./one-shot.pl 2.2250738585072e-308 Sign: 0 Exponent: 0 Fraction: 4503599627370495 0 00000000000 1111111111111111111111111111111111111111111111111111
For 2.225073858507202e-308
, the fraction part is 1111111111111111111111111111111111111111111111111111
. In hex, that's 0xfffffffffffff
. I can get the value as I did earlier, but with a different prefix:
use v5.22; my $F = eval "0x0." . sprintf( "%013X", $fraction ) . 'p0'; my $value = (-1)**$sign * 2**(-1022) * $F;
This accounts for underflow and some other technical details you can read about in the links at the end of this post. It also allows the double to represent different numbers with those 52 bits.
For what it's worth, I used Inline::C to find that smallest value. I initialized a double to 0
then set all the bits in the fractional section:
use v5.22; use Inline C => <<'HERE'; double unnormalized ( int a ) { double d = 0; unsigned char *b = (unsigned char*) &d; unsigned int size = sizeof(double); b[ size - 2 ] = 0b00001111; b[ size - 3 ] = 0b11111111; b[ size - 4 ] = 0b11111111; b[ size - 5 ] = 0b11111111; b[ size - 6 ] = 0b11111111; b[ size - 7 ] = 0b11111111; b[ size - 8 ] = 0b11111111; return d; } HERE my $value = unnormalized(1); say $value; printf "%a\n", $value;