Numbers on a computer

Author

Chris Paciorek

Published

October 13, 2022

PDF

References:

A quick note that, as we’ve already seen, R’s version of scientific notation is XeY, which means \(X\cdot10^{Y}\).

A second note is that the concepts developed here apply outside of R, but we’ll illustrate the principles of computer numbers using R. R makes use of the double and int types in C for the underlying representation of R’s numbers in C variables, so what we’ll really be seeing is how such types behave in C on most modern machines. The behavior of real-valued numbers in Python is essentially the same, but Python handles the integer type differently.

Videos (optional):

There are various videos from 2020 in the bCourses Media Gallery that you can use for reference if you want to.

1. Basic representations

Everything in computer memory or on disk is stored in terms of bits. A bit is essentially a switch than can be either on or off. Thus everything is encoded as numbers in base 2, i.e., 0s and 1s. 8 bits make up a byte. For information stored as plain text (ASCII), each byte is used to encode a single character (as previously discussed, actually only 7 of the 8 bits are actually used, hence there are \(2^{7}=128\) ASCII characters). One way to represent a byte is to write it in hexadecimal, rather than as 8 0/1 bits. Since there are \(2^{8}=256\) possible values in a byte, we can represent it more compactly as 2 base-16 numbers, such as “3e” or “a0” or “ba”. A file format is nothing more than a way of interpreting the bytes in a file.

Here we’ll use the bits function from pryr to look at the underlying binary representation. Note that ‘b’ is encoded as 1 more than ‘a’, and similarly for ‘0’, ‘1’, and ‘2’.

library(pryr)
bits('a')
[1] "01100001"
bits('b')
[1] "01100010"
bits('0')
[1] "00110000"
bits('1')
[1] "00110001"
bits('2')
[1] "00110010"
bits('@')
[1] "01000000"

We can think about how we’d store an integer in terms of bytes. With two bytes (16 bits), we could encode any value from \(0,\ldots,2^{16}-1=65535\). This is an unsigned integer representation. To store negative numbers as well, we can use one bit for the sign, giving us the ability to encode -32767 - 32767 (\(\pm2^{15}-1\)).

R actually uses 4 bytes per integer, so it can encode -2147483647 - 2147483647 (\(\pm2^{31}-1\)). Note that in general, rather than be stored simply as the sign and then a number in base 2, integers (at least the negative ones) are actually stored in different binary encoding to facilitate arithmetic. Here we use the “L” to force R to store the number as an integer. More on that later in the Unit.

library(pryr)
bits(0L)
[1] "00000000 00000000 00000000 00000000"
bytes(0L)
[1] "00 00 00 00"
bits(1L)
[1] "00000000 00000000 00000000 00000001"
bytes(1L)
[1] "00 00 00 01"
bits(2L)
[1] "00000000 00000000 00000000 00000010"
bytes(2L)
[1] "00 00 00 02"
bits(-1L)
[1] "11111111 11111111 11111111 11111111"
bytes(-1L)
[1] "FF FF FF FF"

What do I mean about facilitating arithmetic? As an example, consider adding the binary representations of -1 and 1 above. Nice, right?

Finally note that the set of computer integers is not closed under arithmetic, with R reporting an overflow (i.e., a result that is too large to be stored as an integer using 4 bytes):

a <- as.integer(3423333)  # 3423333L
a * a
Warning in a * a: NAs produced by integer overflow
[1] NA

Real numbers (or floating points) use a minimum of 4 bytes, for single precision floating points. (GPU calculations often use single precision.) In general (including in R) 8 bytes are used to represent real numbers on a computer and these are called double precision floating points or doubles. Let’s see some examples in R of how much space different types of variables take up.

Let’s see how this plays out in terms of memory use in R.

doubleVec <- rnorm(100000)
intVec <- 1:100000
object.size(doubleVec)
800048 bytes
object.size(intVec) # so how many bytes per integer in R?
400048 bytes

We can easily calculate the number of megabytes (MB) a vector of floating points (in double precision) will use as the number of elements times 8 (bytes/double) divided by \(10^{6}\) to convert from bytes to megabytes. (In some cases when considering computer memory, a megabyte is \(1,048,576=2^{20}=1024^{2}\) bytes (this is formally called a mebibyte) so slightly different than \(10^{6}\) – see here for more details). Finally, R has a special object that tells us about the characteristics of computer numbers on the machine that R is running on called .Machine. For example, .Machine\$integer.max is \(2147483647=2^{31}-1\), which confirms how many bytes R is using for each integer (and that R is using a bit for the sign of the integer). Since we have both negative and positive numbers, we have \(2\cdot2^{31}=2^{32}=(2^{8})^{4}\), i.e., 4 bytes, with each byte having 8 bits.

bits(.Machine$integer.max)
[1] "01111111 11111111 11111111 11111111"
bits(-.Machine$integer.max)
[1] "10000000 00000000 00000000 00000001"
bits(-1L)
[1] "11111111 11111111 11111111 11111111"

2. Floating point basics

Representing real numbers

Reals (also called floating points) are stored on the computer as an approximation, albeit a very precise approximation. As an example, if we represent the distance from the earth to the sun using a double, the error is around a millimeter. However, we need to be very careful if we’re trying to do a calculation that produces a very small (or very large number) and particularly when we want to see if numbers are equal to each other.

If you run the code here, the results may surprise you.

0.3 - 0.2 == 0.1
0.3
0.2
0.1 # Hmmm...

0.75 - 0.5 == 0.25
0.6 - 0.4 == 0.2
## any ideas what is different about those two comparisons?

Next, let’s consider the number of digits of accuracy we have for a variety of numbers.

a <- 0.3
b <- 0.2
formatC(b, 20, format = 'f')
[1] "0.20000000000000001110"
formatC(a, 20, format = 'f')
[1] "0.29999999999999998890"
formatC(a - b, 20, format = 'f')
[1] "0.09999999999999997780"
formatC(0.1, 20, format = 'f')
[1] "0.10000000000000000555"
formatC(1/3, 20, format = 'f')
[1] "0.33333333333333331483"

So empirically, it looks like we’re accurate up to the 16th decimal place

But actually, the key is the number of digits, not decimal places.

formatC(1234.1234, 20, format = 'f')
[1] "1234.12339999999994688551"
formatC(1234.123412341234, 20, format = 'f')
[1] "1234.12341234123391586763"

Let’s return to our comparison, 0.75-0.5 == 0.25.

formatC(0.75, 20, format = 'f')
[1] "0.75000000000000000000"
formatC(0.50, 20, format = 'f')
[1] "0.50000000000000000000"

For our future explorations, let’s define a wrapper function for convenience:

dg <- function(x, digits = 20) formatC(x, digits, format = 'f')

## alternative to formatC:
sprintf("%0.20f", a)
[1] "0.29999999999999998890"

Notice that we can represent the result accurately only up to 16 significant digits. This suggests no need to show more than 16 significant digits and no need to print out any more when writing to a file (except that if the number is bigger than \(10^{16}\) then we need extra digits to correctly show the magnitude of the number if not using scientific notation). And of course, often we don’t need anywhere near that many.

Machine epsilon is the term used for indicating the (relative) accuracy of real numbers and it is defined as the smallest float, \(x\), such that \(1+x\ne1\):

dg(1e-16 + 1)
[1] "1.00000000000000000000"
dg(1e-15 + 1)
[1] "1.00000000000000111022"
dg(2e-16 + 1)
[1] "1.00000000000000022204"
dg(.Machine$double.eps)
[1] "0.00000000000000022204"
dg(.Machine$double.eps + 1)
[1] "1.00000000000000022204"

Floating point representation

Floating point refers to the decimal point (or radix point since we’ll be working with base 2 and decimal relates to 10). Consider Avogadro’s number in terms of scientific notation: \(+6.023\times10^{23}\). As a baseline for what is about to follow note that we can express a decimal number in the following expansion

\[6.03=6\times10^{0}+0\times10^{-1}+3\times10^{-2}\] A real number on a computer is stored in what is basically scientific notation:

\[\pm d_{0}.d_{1}d_{2}\ldots d_{p}\times b^{e}\label{eq:floatRep}\]

where \(b\) is the base, \(e\) is an integer and \(d_{i}\in\{0,\ldots,b-1\}\). \(e\) is called the exponent and \(d=d_{1}d_{2}\ldots d_{p}\) is called the mantissa.

Let’s consider the choices that the computer pioneers needed to make in using this system to represent numbers on a computer using base 2. First, we need to choose the number of bits to represent \(e\) so that we can represent sufficiently large and small numbers. Second we need to choose the number of bits, \(p\), to allocate to \(d=d_{1}d_{2}\ldots d_{p}\), which determines the accuracy of any computer representation of a real.

The great thing about floating points is that we can represent numbers that range from incredibly small to very large while maintaining good precision. The floating point floats to adjust to the size of the number. Suppose we had only three digits to use and were in base 10. In floating point notation we can express \(0.12\times0.12=0.0144\) as \((1.20\times10^{-1})\times(1.20\times10^{-1})=1.44\times10^{-2}\), but if we had fixed the decimal point, we’d have \(0.120\times0.120=0.014\) and we’d have lost a digit of accuracy. (Furthermore, we wouldn’t be able to represent numbers bigger than \(0.99\).

More specifically, the actual storage of a number on a computer these days is generally as a double in the form:

\[(-1)^{S}\times1.d\times2^{e-1023}=(-1)^{S}\times1.d_{1}d_{2}\ldots d_{52}\times2^{e-1023}\]

where the computer uses base 2, \(b=2\), (so \(d_{i}\in\{0,1\}\)) because base-2 arithmetic is faster than base-10 arithmetic. The leading 1 normalizes the number; i.e., ensures there is a unique representation for a given computer number. This avoids representing any number in multiple ways, e.g., either \(1=1.0\times2^{0}=0.1\times2^{1}=0.01\times2^{2}\). For a double, we have 8 bytes=64 bits. Consider our representation as (\(S,d,e\)) where \(S\) is the sign. The leading 1 is the hidden bit and doesn’t need to be stored because it is always present. In general \(e\) is represented using 11 bits (\(2^{11}=2048\)), and the subtraction takes the place of having a sign bit for the exponent. (Note that in our discussion we’ll just think of \(e\) in terms of its base 10 representation, although it is of course represented in base 2.) This leaves \(p=52 = 64-1-11\) bits for \(d\).

bits(2^(-1)) # 1/2
[1] "00111111 11100000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(2^0)  # 1
[1] "00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(2^1)  # 2
[1] "01000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(2^1 + 2^0)  # 3
[1] "01000000 00001000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(2^2)  # 4
[1] "01000000 00010000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(-2)
[1] "11000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000"

Question: Given a fixed number of bits for a number, what is the tradeoff between using bits for the \(d\) part vs. bits for the \(e\) part?

Let’s consider what can be represented exactly:

dg(.1)
[1] "0.10000000000000000555"
dg(.5)
[1] "0.50000000000000000000"
dg(.25)
[1] "0.25000000000000000000"
dg(.26)
[1] "0.26000000000000000888"
dg(1/32)
[1] "0.03125000000000000000"
dg(1/33)
[1] "0.03030303030303030387"

So why is 0.5 stored exactly and 0.1 not stored exactly? By analogy, consider the difficulty with representing 1/3 in base 10.

Overflow and underflow

The largest and smallest numbers we can represent are \(2^{e_{\max}}\) and \(2^{e_{\min}}\) where \(e_{\max}\) and \(e_{\min}\) are the smallest and largest possible values of the exponent. Let’s consider the exponent and what we can infer about the range of possible numbers. With 11 bits for \(e\), we can represent \(\pm2^{10}=\pm1024\) different exponent values (see .Machine$double.max.exp) (why is .Machine$double.min.exp only -1022? ). So the largest number we could represent is \(2^{1024}\). What is this in base 10?

log10(2^1024) # whoops ... we've actually just barely overflowed
[1] Inf
log10(2^1023)
[1] 307.9537
.Machine$double.xmax
[1] 1.797693e+308
.Machine$double.xmin
[1] 2.225074e-308

We could have been smarter about that calculation: \(\log_{10}2^{1024}=\log_{2}2^{1024}/\log_{2}10=1024/3.32\approx308\). The result is analogous for the smallest number, so we have that floating points can range between \(1\times10^{-308}\) and \(1\times10^{308}\). Take a look at .Machine$double.xmax and .Machine.double.xmin. Producing something larger or smaller in magnitude than these values is called overflow and underflow respectively. When we overflow, R gives back an Inf or -Inf (and in other cases we might get an error message). When we underflow, we get back 0, which in particular can be a problem if we try to divide by the value.

Integers or floats?

Values stored as integers should overflow if they exceed .Machine$integer.max.

Should \(2^{45}\) overflow?

x <- 2^45
z <- 25
class(x)
[1] "numeric"
class(z)
[1] "numeric"
as.integer(x)
Warning: NAs introduced by coercion to integer range
[1] NA
as.integer(z)
[1] 25
1e308
[1] 1e+308
1e309
[1] Inf
2^31
[1] 2147483648
x <- 2147483647L
x
[1] 2147483647
class(x)
[1] "integer"
x <- 2147483648L
class(x)
[1] "numeric"

In R, numbers are generally stored as doubles. We’ve basically already seen why - consider the maximum integer when using 4 bytes and the maximum floating point value. Representing integers as floats isn’t generally a problem, in part because integers will be stored exactly in base two provided the absolute value is less than \(2^{53}\).

Challenge: Why \(2^{53}\)? Write out what integers can be stored exactly in our base 2 representation of floating point numbers.

However, you can force storage as integers in a few ways: values generated based on seq(), based on the : operator, specified with an “L”, or explicitly coerced:

x <- 3; typeof(x)
[1] "double"
x <- as.integer(3); typeof(x)
[1] "integer"
x <- 3L; typeof(x)
[1] "integer"
x <- 3:5; typeof(x)
[1] "integer"

Precision

Consider our representation as (S, d, e) where we have \(p=52\) bits for \(d\). Since we have \(2^{52}\approx0.5\times10^{16}\), we can represent about that many discrete values, which means we can accurately represent about 16 digits (in base 10). The result is that floats on a computer are actually discrete (we have a finite number of bits), and if we get a number that is in one of the gaps (there are uncountably many reals), it’s approximated by the nearest discrete value. The accuracy of our representation is to within 1/2 of the gap between the two discrete values bracketing the true number. Let’s consider the implications for accuracy in working with large and small numbers. By changing \(e\) we can change the magnitude of a number. So regardless of whether we have a very large or small number, we have about 16 digits of accuracy, since the absolute spacing depends on what value is represented by the least significant digit (the ulp, or unit in the last place) in \(d\), i.e., the \(p=52\)nd one, or in terms of base 10, the 16th digit. Let’s explore this:

# large vs. small numbers
dg(.1234123412341234)
[1] "0.12341234123412339607"
dg(1234.1234123412341234) # not accurate to 16 decimal places 
[1] "1234.12341234123414324131"
dg(123412341234.123412341234) # only accurate to 4 places 
[1] "123412341234.12341308593750000000"
dg(1234123412341234.123412341234) # no places! 
[1] "1234123412341234.00000000000000000000"
dg(12341234123412341234) # fewer than no places! 
[1] "12341234123412340736.00000000000000000000"

We can see the implications of this in the context of calculations:

dg(1234567812345678 - 1234567812345677)
[1] "1.00000000000000000000"
dg(12345678123456788888 - 12345678123456788887)
[1] "0.00000000000000000000"
dg(12345678123456780000 - 12345678123456770000)
[1] "10240.00000000000000000000"

The spacing of possible computer numbers that have a magnitude of about 1 leads us to another definition of machine epsilon (an alternative, but essentially equivalent definition to that given previously in this Unit). Machine epsilon tells us also about the relative spacing of numbers. First let’s consider numbers of magnitude one. The difference between \(1=1.00...00\times2^{0}\) and \(1.000...01\times2^{0}\) is \(\epsilon=1\times2^{-52}\approx2.2\times10^{-16}\). Machine epsilon gives the absolute spacing for numbers near 1 and the relative spacing for numbers with a different order of magnitude and therefore a different absolute magnitude of the error in representing a real. The relative spacing at \(x\) is \[\frac{(1+\epsilon)x-x}{x}=\epsilon\] since the next largest number from \(x\) is given by \((1+\epsilon)x\).

Suppose \(x=1\times10^{6}\). Then the absolute error in representing a number of this magnitude is \(x\epsilon\approx2\times10^{-10}\). (Actually the error would be one-half of the spacing, but that’s a minor distinction.) We can see by looking at the numbers in decimal form, where we are accurate to the order \(10^{-10}\) but not \(10^{-11}\). This is equivalent to our discussion that we have only 16 digits of accuracy.

dg(1000000.1)
[1] "1000000.09999999997671693563"

Let’s see what arithmetic we can do exactly with integer-valued numbers stored as doubles and how that relates to the absolute spacing of numbers we’ve just seen:

dgi <- function(x) formatC(x, digits = 20, format = 'g')

dgi(2^52)
[1] "     4503599627370496"
dgi(2^52+1)
[1] "     4503599627370497"
dgi(2^53)
[1] "     9007199254740992"
dgi(2^53+1)
[1] "     9007199254740992"
dgi(2^53+2)
[1] "     9007199254740994"
dgi(2^54)
[1] "    18014398509481984"
dgi(2^54+2)
[1] "    18014398509481984"
dgi(2^54+4)
[1] "    18014398509481988"
bits(2^53)
[1] "01000011 01000000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(2^53+1)
[1] "01000011 01000000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(2^53+2)
[1] "01000011 01000000 00000000 00000000 00000000 00000000 00000000 00000001"
bits(2^54)
[1] "01000011 01010000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(2^54+2)
[1] "01000011 01010000 00000000 00000000 00000000 00000000 00000000 00000000"
bits(2^54+4)
[1] "01000011 01010000 00000000 00000000 00000000 00000000 00000000 00000001"

The absolute spacing is \(x\epsilon\), so we have spacings of \(2^{52}\times2^{-52}=1\), \(2^{53}\times2^{-52}=2\), \(2^{54}\times2^{-52}=4\) for numbers of magnitude \(2^{52}\), \(2^{53}\), and \(2^{54}\), respectively.

With a bit more work (e.g., using Mathematica), one can demonstrate that doubles in R in general are represented as the nearest number that can stored with the 64-bit structure we have discussed and that the spacing is as we have discussed. The results below show the spacing that results, in base 10, for numbers around 0.1. The numbers R reports are spaced in increments of individual bits in the base 2 representation.

dg(0.1234567812345678)
[1] "0.12345678123456779729"
dg(0.12345678123456781)
[1] "0.12345678123456781117"
dg(0.12345678123456782)
[1] "0.12345678123456782505"
dg(0.12345678123456783)
[1] "0.12345678123456782505"
dg(0.12345678123456784)
[1] "0.12345678123456783892"
bits(0.1234567812345678)
[1] "00111111 10111111 10011010 11011101 00010101 11011111 00110100 10000110"
bits(0.12345678123456781)
[1] "00111111 10111111 10011010 11011101 00010101 11011111 00110100 10000111"
bits(0.12345678123456782)
[1] "00111111 10111111 10011010 11011101 00010101 11011111 00110100 10001000"
bits(0.12345678123456783)
[1] "00111111 10111111 10011010 11011101 00010101 11011111 00110100 10001000"
bits(0.12345678123456784)
[1] "00111111 10111111 10011010 11011101 00010101 11011111 00110100 10001001"

Working with higher precision numbers

The Rmpfr package allows us to work with numbers in higher precision. (This code is not working with knitr, so I’m just showing the code here, not the output.)

require(Rmpfr)
piLong <- Const("pi", prec = 260) # pi "computed" to correct 260-bit precision 
piLong # nicely prints 80 digits 
mpfr(".1234567812345678", 40)
mpfr(".1234567812345678", 80)
mpfr(".1234567812345678", 600)

In contrast to R, Python has arbitrary precision integers. So, e.g., pow(3423333, 15) returns an integer. But floating points are handled in similar fashion to R.

3. Implications for calculations and comparisons

Computer arithmetic is not mathematical arithmetic!

As mentioned for integers, computer number arithmetic is not closed, unlike real arithmetic. For example, if we multiply two computer floating points, we can overflow and not get back another computer floating point. One term that is used, which might pop up in an error message (though probably not in R) is that an “exception” is “thrown”.

Another mathematical concept we should consider here is that computer arithmetic does not obey the associative and distributive laws, i.e., \((a+b)+c\) may not equal \(a+(b+c)\) on a computer and \(a(b+c)\) may not be the same as \(ab+ac\). Here’s an example with multiplication:

val1 <- 1/10; val2 <- 0.31; val3 <- 0.57
res1 <- val1*val2*val3
res2 <- val3*val2*val1
identical(res1, res2)
[1] FALSE
dg(res1)
[1] "0.01766999999999999821"
dg(res2)
[1] "0.01767000000000000168"

Calculating with integers vs. floating points

It’s important to note that operations with integers are fast and exact (but can easily overflow) while operations with floating points are slower and approximate. Because of this slowness, floating point operations (flops) dominate calculation intensity and are used as the metric for the amount of work being done - a multiplication (or division) combined with an addition (or subtraction) is one flop. We’ll talk a lot about flops in the unit on linear algebra.

Comparisons

As we saw, we should never test a==b unless (1) a and b are represented as integers in R, (2) they are integer-valued but stored as doubles that are small enough that they can be stored exactly) or (3) they are decimal numbers that have been created in the same way (e.g., 0.4-0.3==0.4-0.3 returns TRUE but 0.1==0.4-0.3 does not). Similarly we should be careful about testing a==0. And be careful of greater than/less than comparisons. For example, be careful of x[ x < 0 ] <- NA if what you are looking for is values that might be mathematically less than zero, rather than whatever is numerically less than zero.

4L - 3L == 1L
[1] TRUE
4.0 - 3.0 == 1.0
[1] TRUE
4.1 - 3.1 == 1.0
[1] FALSE

One nice approach to checking for approximate equality is to make use of machine epsilon. If the relative spacing of two numbers is less than machine epsilon, then for our computer approximation, we say they are the same. Here’s an implementation that relies on the absolute spacing being \(x\epsilon\) (see above).

a = 12345678123456781000
b = 12345678123456782000

approxEqual = function(a, b){
  if(abs(a - b) < .Machine$double.eps * abs(a + b))
    print("approximately equal") else print ("not equal")
}

approxEqual(a,b)
[1] "approximately equal"
a = 1234567812345678
b = 1234567812345677

approxEqual(a,b)   
[1] "not equal"

Actually, we probably want to use a number slightly larger than .Machine$double.eps to be safe. You can also take a look at the R function all.equal.numeric().

Finally, in computing, we often encounter the use of an unusual integer as a symbol for missing values. E.g., a datafile might store missing values as -9999. Testing for this using == in R should generally be ok:x [ x == -9999 ] <- NA, both because integers of this magnitude are stored exactly and because the -9999 values would presumably have been created in the same way. To be really careful, you can read in as character type and do the assessment before converting to numeric.

Calculations

Given the limited precision of computer numbers, we need to be careful when in the following two situations.

  1. Subtracting large numbers that are nearly equal (or adding negative and positive numbers of the same magnitude). You won’t have the precision in the answer that you would like. How many decimal places of accuracy do we have here?

    # catastrophic cancellation w/ large numbers
    dg(123456781234.56 - 123456781234.00)
    [1] "0.55999755859375000000"

    The absolute error in the original numbers here is of the order \(\epsilon x=2.2\times10^{-16}\cdot1\times10^{11}\approx1\times10^{-5}=.00001\). While we might think that the result is close to the value 1 and should have error of about machine epsilon, the relevant absolute error is in the original numbers, so we actually only have about five significant digits in our result because we cancel out the other digits.

    This is called catastrophic cancellation, because most of the digits that are left represent rounding error – many of the significant digits have cancelled with each other.
    Here’s catastrophic cancellation with small numbers. The right answer here is exactly 0.000000000000000000001234.

    # catastrophic cancellation w/ small numbers
    a = .000000000000123412341234
    b = .000000000000123412340000
    
    # so we know the right answer is .000000000000000000001234 EXACTLY  
    
    dg(a-b, 35)
    [1] "0.00000000000000000000123399999315140"
    ## [1] "0.00000000000000000000123399999315140"

    But the result is accurate only to 8 places + 20 = 28 decimal places, as expected from a machine precision-based calculation, since the “1” is in the 13th position, after 12 zeroes (12+16=28). Ideally, we would have accuracy to 36 places (16 digits + the 20 zeroes), but we’ve lost 8 digits to catastrophic cancellation.

    It’s best to do any subtraction on numbers that are not too large. For example, if we compute the sum of squares in a naive way, we can lose all of the information in the calculation because the information is in digits that are not computed or stored accurately:

    \[s^{2}=\sum x_{i}^{2}-n\bar{x}^{2}\]

    ## No problem here:
    x <- c(-1, 0, 1)
    n <- length(x)
    sum(x^2)-n*mean(x)^2 
    [1] 2
    sum((x - mean(x))^2)
    [1] 2
    ## Adding/subtracting a constant shouldn't change the result:
    x <- x + 1e8
    sum(x^2)-n*mean(x)^2  # the result of this is not good!
    [1] 0
    sum((x - mean(x))^2)
    [1] 2

    A good principle to take away is to subtract off a number similar in magnitude to the values (in this case \(\bar{x}\) is obviously ideal) and adjust your calculation accordingly. In general, you can sometimes rearrange your calculation to avoid catastrophic cancellation. Another example involves the quadratic formula for finding a root (p. 101 of Gentle).

  2. Adding or subtracting numbers that are very different in magnitude. The precision will be that of the large magnitude number, since we can only represent that number to a certain absolute accuracy, which is much less than the absolute accuracy of the smaller number:

    dg(123456781234.2)
    [1] "123456781234.19999694824218750000"
    dg(123456781234.2 - 0.1)        # truth: 123456781234.1
    [1] "123456781234.09999084472656250000"
    dg(123456781234.2 - 0.01)       # truth: 123456781234.19
    [1] "123456781234.19000244140625000000"
    dg(123456781234.2 - 0.001)      # truth: 123456781234.199
    [1] "123456781234.19898986816406250000"
    dg(123456781234.2 - 0.0001)     # truth: 123456781234.1999
    [1] "123456781234.19989013671875000000"
    dg(123456781234.2 - 0.00001)    # truth: 123456781234.19999
    [1] "123456781234.19998168945312500000"
    dg(123456781234.2 - 0.000001)   # truth: 123456781234.199999
    [1] "123456781234.19999694824218750000"
    123456781234.2 - 0.000001 == 123456781234.2
    [1] TRUE

    The larger number in the calculations above is of magnitude \(10^{11}\), so the absolute error in representing the larger number is around \(1\times10^{^{-5}}\). Thus in the calculations above we can only expect the answers to be accurate to about \(1\times10^{-5}\). In the last calculation above, the smaller number is smaller than \(1\times10^{-5}\) and so doing the subtraction has had no effect. This is analogous to trying to do \(1+1\times10^{-16}\) and seeing that the result is still 1.
    A work-around when we are adding numbers of very different magnitudes is to add a set of numbers in increasing order. However, if the numbers are all of similar magnitude, then by the time you add ones later in the summation, the partial sum will be much larger than the new term. A (second) work-around to that problem is to add the numbers in a tree-like fashion, so that each addition involves a summation of numbers of similar size.

Given the limited range of computer numbers, be careful when you are:

  • Multiplying or dividing many numbers, particularly large or small ones. Never take the product of many large or small numbers as this can cause over- or under-flow. Rather compute on the log scale and only at the end of your computations should you exponentiate. E.g.,

    \[\prod_{i}x_{i}/\prod_{j}y_{j}=\exp(\sum_{i}\log x_{i}-\sum_{j}\log y_{j})\]

Let’s consider some challenges that illustrate that last concern.

  • Challenge: consider multiclass logistic regression, where you have quantities like this:

    \[p_{j}=\text{Prob}(y=j)=\frac{\exp(x\beta_{j})}{\sum_{k=1}^{K}\exp(x\beta_{k})}=\frac{\exp(z_{j})}{\sum_{k=1}^{K}\exp(z_{k})}\]

    for \(z_{k}=x\beta_{k}\). What will happen if the \(z\) values are very large in magnitude (either positive or negative)? How can we reexpress the equation so as to be able to do the calculation? Hint: think about multiplying by \(\frac{c}{c}\) for a carefully chosen \(c\).

  • Second challenge: The same issue arises in the following calculation. Suppose I want to calculate a predictive density (e.g., in a model comparison in a Bayesian context): \[\begin{aligned} f(y^{*}|y,x) & = & \int f(y^{*}|y,x,\theta)\pi(\theta|y,x)d\theta\\ & \approx & \frac{1}{m}\sum_{j=1}^{m}\prod_{i=1}^{n}f(y_{i}^{*}|x,\theta_{j})\\ & = & \frac{1}{m}\sum_{j=1}^{m}\exp\sum_{i=1}^{n}\log f(y_{i}^{*}|x,\theta_{j})\\ & \equiv & \frac{1}{m}\sum_{j=1}^{m}\exp(v_{j})\end{aligned}\] First, why do I use the log conditional predictive density? Second, let’s work with an estimate of the unconditional predictive density on the log scale, \(\log f(y^{*}|y,x)\approx\log\frac{1}{m}\sum_{j=1}^{m}\exp(v_{j})\). Now note that \(e^{v_{j}}\) may be quite small as \(v_{j}\) is the sum of log likelihoods. So what happens if we have terms something like \(e^{-1000}\)? So we can’t exponentiate each individual \(v_{j}\). This is what is known as the “log sum of exponentials” problem (and the solution as the “log-sum-exp trick”). Thoughts?

Numerical issues come up frequently in linear algebra. For example, they come up in working with positive definite and semi-positive-definite matrices, such as covariance matrices. You can easily get negative numerical eigenvalues even if all the eigenvalues are positive or non-negative. Here’s an example where we use an squared exponential correlation as a function of time (or distance in 1-d), which is mathematically positive definite (i.e., all the eigenvalues are positive) but not numerically positive definite:

xs <- 1:100
dists <- fields::rdist(xs)
corMat <- exp(- (dists/10)^2) # this is a p.d. matrix (mathematically)
dg(eigen(corMat)$values[80:100])  # but not numerically
 [1] "-0.00000000000000005935" "-0.00000000000000006021"
 [3] "-0.00000000000000006401" "-0.00000000000000006473"
 [5] "-0.00000000000000007597" "-0.00000000000000008904"
 [7] "-0.00000000000000009388" "-0.00000000000000009581"
 [9] "-0.00000000000000011133" "-0.00000000000000011487"
[11] "-0.00000000000000012511" "-0.00000000000000013184"
[13] "-0.00000000000000014807" "-0.00000000000000014981"
[15] "-0.00000000000000016497" "-0.00000000000000016666"
[17] "-0.00000000000000016699" "-0.00000000000000018564"
[19] "-0.00000000000000018859" "-0.00000000000000021561"
[21] "-0.00000000000000073151"

Final note

How the computer actually does arithmetic with the floating point representation in base 2 gets pretty complicated, and we won’t go into the details. These rules of thumb should be enough for our practical purposes. Monahan and the URL reference have many of the gory details.