+ B
A
# Matrix multiplication
np.matmul(A, B) @ B # alternative
A # not recommended by the NumPy docs
A.dot(B)
* B # Hadamard (direct) product A
Numerical linear algebra
Overview
References:
- Gentle: Numerical Linear Algebra for Applications in Statistics (available via UC Library Search) (my notes here are based primarily on this source) [Gentle-NLA]
- Gentle: Matrix Algebra also has much of this material.
- Gentle: Computational Statistics [Gentle-CS]
- Lange: Numerical Analysis for Statisticians
- Monahan: Numerical Methods of Statistics
Videos (optional):
There are various videos from 2020 in the bCourses Media Gallery that you can use for reference if you want to.
- Video 1. Ill-conditioned problems, part 1
- Video 2. Ill-conditioned problems, part 2
- Video 3. Triangular systems of equations
- Video 4. Solving systems of equations via LU, part 1
- Video 5. Solving systems of equations via LU, part 2
- Video 6. Solving systems of equations via LU, part 3
- Video 7. Cholesky decomposition
In working through how to compute something or understanding an algorithm, it can be very helpful to depict the matrices and vectors graphically. We’ll see this on the board in class.
1. Preliminaries
Context
Many statistical and machine learning methods involve linear algebra of some sort - at the very least matrix multiplication and very often some sort of matrix decomposition to fit models and do analysis: linear regression, various more sophisticated forms of regression, deep neural networks, principle components analysis (PCA) and the wide varieties of generalizations and variations on PCA, etc., etc.
Goals
Here’s what I’d like you to get out of this unit:
- How to think about the computational order (number of computations involved) of a problem
- How to choose a computational approach to a given linear algebra calculation you need to do.
- An understanding of how issues with computer numbers (Unit 8) affect linear algebra calculations.
Key principle
The form of a mathematical expression and how it should be evaluated on a computer may be very different. Better computational approaches can increase speed and improve the numerical properties of the calculation.
- Example 1 (already seen in Unit 5): If
and are matrices and is a vector, we should compute rather than ; the former is much more computationally efficient. - Example 2: We do not compute
by computing and finding its inverse. In fact, perhaps more surprisingly, we may never actually form in some implementations. - Example 3: Suppose I have a matrix
, and I want to permute (switch) two rows. I can do this with a permutation matrix, , which is mostly zeroes. On a computer, in general I wouldn’t need to even change the values of in memory in some cases (e.g., if I were to calculate ). Why not?
Computational complexity
We can assess the computational complexity of a linear algebra calculation by counting the number multiplys/divides and the number of adds/subtracts. Sidenote: addition is a bit faster than multiplication, so some algorithms attempt to trade multiplication for addition.
In general we do not try to count the actual number of calculations, but just their order, though in some cases in this unit we’ll actually get a more exact count. In general, we denote this as
For two symmetric,
When we have
In real calculations, it’s possible to have the actual time ordering of two approaches differ from what the order approximations tell us. For example, something that involves
A note on terminology: flops stands for both floating point operations (the number of operations required) and floating point operations per second, the speed of calculation.
Notation and dimensions
I’ll try to use capital letters for matrices,
Throughout, we’ll need to be careful that the matrices involved in an operation are conformable: for
The inner product of two vectors is
The outer product is
When the indices of summation should be obvious, I’ll sometimes leave them implicit. Ask me if it’s not clear.
Norms
For a vector,
One commonly used norm for a matrix is the Frobenius norm,
In this Unit, we’ll often make use of the induced matrix norm, which is defined relative to a corresponding vector norm,
We can interpret the norm of a matrix as the most that the matrix can stretch a vector when multiplying by the vector (relative to the length of the vector).
A property of any legitimate matrix norm (including the induced norm) is that
A normalized vector is one with “length”, i.e., Euclidean norm, of one. We can easily normalize a vector:
The angle between two vectors is
Orthogonality
Two vectors are orthogonal if
Permutations
Sometimes we make use of matrices that permute two rows (or two columns) of another matrix when multiplied. Such a matrix is known as an elementary permutation matrix and is an orthogonal matrix with a determinant of -1. You can multiply such matrices to get more general permutation matrices that are also orthogonal. If you premultiply by
Some vector and matrix properties
In Python, recall the syntax is
You don’t need the spaces, but they’re nice for code readability.
Trace and determinant of square matrices
The trace of a matrix is the sum of the diagonal elements. For square matrices,
We also have
- We can find the ordering that reduces computation the most if the individual matrices are not square.
since the quadratic form, , is a scalar, and this is equal to where is a matrix. It can be helpful to be able to go back and forth between a scalar and a trace in some statistical calculations.
For square matrices, the determinant exists and we have
Transposes and inverses
For square, invertible matrices, we have that
For two invertible matrices, we have that
Other matrix multiplications
The Hadamard or direct product is simply multiplication of the correspoding elements of two matrices by each other. In R this is simplyA * B
.
Challenge: How can I find A %*% B
?
The Kronecker product is the product of each element of one matrix with the entire other matrix”
The inverse of a Kronecker product is the Kronecker product of the inverses,
which is obviously quite a bit faster because the inverse (i.e., solving a system of equations) in this special case is
Matrix decompositions
A matrix decomposition is a re-expression of a matrix,
- having fewer rows or columns;
- being diagonal, triangular or sparse in some way,
- being orthogonal matrices.
In addition, once you have a decomposition, computation is generally easier, because of the special structure of the simpler matrices.
We’ll see this in great detail in Section 3.
2. Statistical interpretations of matrix invertibility, rank, etc.
Linear independence, rank, and basis vectors
A set of vectors,
Any set of linearly independent vectors (say
Consider a regression context. We have
When fitting a regression, if
Invertibility, singularity, rank, and positive definiteness
For square matrices, let’s consider how invertibility, singularity, rank and positive (or non-negative) definiteness relate.
Square matrices that are “regular” have an eigendecomposition,
Let’s focus on symmetric matrices. The symmetric matrices that tend to arise in statistics are either positive definite (p.d.) or non-negative definite (n.n.d.). If a matrix is positive definite, then by definition numpy.linalg.eig(A)[1]
is numpy.linalg.eig(A)[0]
contains the (unordered) eigenvalues.
To summarize, here are some of the various connections between mathematical and statistical properties of positive definite matrices:
And here are connections for positive semi-definite matrices:
Interpreting an eigendecomposition
Let’s interpret the eigendecomposition in a generative context as a way of generating random vectors. We can generate
To go the other direction, we can project a vector
If
What does it mean when one or more eigenvalue (i.e.,
) is zero?Suppose I have an eigenvalue that is very small and I set it to zero? What will be the impact upon
and ?Now let’s consider the inverse of a covariance matrix, known as the precision matrix,
. What does it mean if a is very large? What if is very small?
Consider an arbitrary
Generalized inverses (optional)
Suppose I want to find
Generalized inverses arise in solving equations when
We can find the pseudo-inverse based on an eigendecomposition (or an SVD) as
Let’s consider a specific example. Autoregressive models are often used for smoothing (in time, in space, and in covariates). A first order autoregressive model for
If we look at the eigendecomposition of such matrices, we see that in the first order case, the eigenvalue corresponding to the constant eigenvector is zero.
import numpy as np
= np.array([[1,-1,0,0,0],[-1,2,-1,0,0],[0,-1,2,-1,0],[0,0,-1,2,-1],[0,0,0,-1,1]])
precMat = np.linalg.eig(precMat)
e 0] # 4th eigenvalue is numerically zero e[
array([3.61803399e+00, 2.61803399e+00, 1.38196601e+00, 4.97762256e-17,
3.81966011e-01])
1][:,3] # constant eigenvector e[
array([0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136])
This means we have no information about the overall level of
# Generate a realization.
= e[0]
evals = 1/evals # variances
evals 3] = 0 # generalized inverse
evals[= np.random.default_rng(seed=1)
rng = e[1] @ ((evals ** 0.5) * rng.normal(size = 5))
y sum() y.
-6.661338147750939e-16
In the second order case, we have two non-identifiabilities: for the sum and for the linear component of the variation in
I could parameterize a statistical model as
Matrices arising in regression
In regression, we work with
Fitted values are
3. Computational issues
Storing matrices
We’ve discussed column-major and row-major storage of matrices. First, retrieval of matrix elements from memory is quickest when multiple elements are contiguous in memory. So in a column-major language (e.g., R, Fortran), it is best to work with values in a common column (or entire columns) while in a row-major language (e.g., Python, C) for values in a common row.
In some cases, one can save space (and potentially speed) by overwriting the output from a matrix calculation into the space occupied by an input. This occurs in some clever implementations of matrix factorizations.
Algorithms
Good algorithms can change the efficiency of an algorithm by one or more orders of magnitude, and many of the improvements in computational speed over recent decades have been in algorithms rather than in computer speed.
Most matrix algebra calculations can be done in multiple ways. For example, we could compute
- Collect the inner products of the rows of
with .
# initialize b[1:n]=0
for(i=1:n) {
for(j=1:m){
b_i = b_i + a_{ij} x_j
}
}
- Take the linear combination (based on
) of the columns of
# initialize b[1:n]=0
for(j=1:m){
for(i = 1:n){
b_i = b_i + a_{ij} x_j
}
}
In this case the two approaches involve the same number of operations. But the first accesses
Check whether the first approach is faster in Python with numpy’s default row-major ordering. (Write the code just doing the outer loop as a for loop and doing the inner loop using vectorized calculation.) Your answer will probably depend on how big the matrices are.
General computational issues
The same caveats we discussed in terms of computer arithmetic hold naturally for linear algebra, since this involves arithmetic with many elements. Good implementations of algorithms are aware of the danger of catastrophic cancellation and of the possibility of dividing by zero or by values that are near zero.
Ill-conditioned problems
Basics
A problem is ill-conditioned if small changes to values in the computation result in large changes in the result. This is quantified by something called the condition number of a calculation. For different operations there are different condition numbers.
Ill-conditionedness arises most often in terms of matrix inversion, so the standard condition number is the “condition number with respect to inversion”, which when using the
def norm2(x):
return(np.sum(x**2) ** 0.5)
= np.array([[10,7,8,7],[7,5,6,5],[8,6,10,9],[7,5,9,10]])
A = np.array([32,23,33,31])
b = np.linalg.solve(A, b)
x
= np.array([32.1, 22.9, 33.1, 30.9])
bPerturbed = np.linalg.solve(A, bPerturbed)
xPerturbed
= bPerturbed - b
delta_b = xPerturbed - x delta_x
What’s going on? Some manipulations with inequalities involving the induced matrix norm (for any chosen vector norm, but we might as well just think about the Euclidean norm) (see Gentle-CS Sec. 5.1 or the derivation in class) give
We see in the code below that the large disparity in eigenvalues of
= np.linalg.eig(A)
e = e[0]
evals print(evals)
[3.02886853e+01 3.85805746e+00 1.01500484e-02 8.43107150e-01]
## relative perturbation in x much bigger than in b
/ norm2(x) norm2(delta_x)
8.19847546803699
/ norm2(b) norm2(delta_b)
0.0033319453118976702
## ratio of relative perturbations
/ norm2(x)) / (norm2(delta_b) / norm2(b)) (norm2(delta_x)
2460.567236431514
## ratio of largest and smallest magnitude eigenvalues
## confusingly evals[2] is the smallest, not evals[3]
0]/evals[2]) (evals[
2984.092701676269
The main use of these ideas for our purposes is in thinking about the numerical accuracy of a linear system solution (Gentle-NLA Sec 3.4). On a computer we have the system
Improving conditioning
Ill-conditioned problems in statistics often arise from collinearity of regressors. Often the best solution is not a numerical one, but re-thinking the modeling approach, as this generally indicates statistical issues beyond just the numerical difficulties.
A general comment on improving conditioning is that we want to avoid large differences in the magnitudes of numbers involved in a calculation. In some contexts such as regression, we can center and scale the columns to avoid such differences - this will improve the condition of the problem. E.g., in simple quadratic regression with
import statsmodels.api as sm
= np.random.default_rng(seed=1)
rng
= np.arange(1990, 2011) # naive covariate
t1 = len(t1)
n = np.column_stack((np.ones(n), t1, t1 ** 2))
X1
= np.array([5, 0.1, 0.0001])
beta = X1 @ beta + rng.normal(size = n)
y
= np.linalg.eig(X1.T @ X1)
e1 0])[::-1] np.sort(e1[
array([3.36018564e+14, 7.69949736e+02, 2.24079720e-08])
@ X1) # built-in! np.linalg.cond(X1.T
4.653329826789808e+21
sm.OLS(y, X1).fit().params
array([-1.14908044e+03, 1.28438902e+00, -2.03662592e-04])
The fitted values are quite different than the true beta values of (5, 0.1, 0.0001), particularly the intercept and linear terms. This is primarily because of the ill-conditioning, rather than that the OLS estimate is different than the true values because we are estimating beta.
Now we’ll do a simple transformation, subtracting off the mean of the covariate, so that the covariate values (in particular the quadratic values) don’t have such large magnitudes.
= t1 - 2000 # centered
t2 = np.column_stack((np.ones(n), t2, t2 ** 2))
X2 = np.linalg.eig(X2.T @ X2)
e2 with np.printoptions(suppress=True):
print(np.sort(e2[0])[::-1])
[50677.70427505 770. 9.29572495]
sm.OLS(y, X2).fit().params
array([ 6.05047224e+02, 4.69738650e-01, -2.03662592e-04])
We can work out that the effect of centering is that the true beta values after centering are (605, 0.5, 0.0001). (Consider
Interestingly, the estimate for the quadratic terms has not changed at all. This is quite surprising. There must be something subtle going on that causes that not to be affected by the bad conditioning, but I’m not sure what is happening there.
We could go even further to improve the conditioning.
= t2/10 # centered and scaled
t3 = np.column_stack((np.ones(n), t3, t3 ** 2))
X3 = np.linalg.eig(X3.T @ X3)
e3 with np.printoptions(suppress=True):
print(np.sort(e3[0])[::-1])
[24.11293487 7.7 1.95366513]
I haven’t shown the OLS results for the third version, but with a bit of arithmetic to be done, but you should be able to verify that the third approach also gives reasonable answers.
The basic story is that simple strategies often solve the problem, and that you should be aware of the absolute and relative magnitudes involved in your calculations.
One rule of thumb is to try to work with numbers whose magnitude is around 1. We can often scale the values in our problem in order to do this. I.e., change the units of your variables. Instead of personal income in dollars, use personal income in thousands or hundreds of thousands of dollars.
4. Matrix factorizations (decompositions) and solving systems of linear equations
Suppose we want to solve the following linear system:
Numerically, this is never done by finding the inverse and multiplying. Rather we solve the system using a matrix decomposition (or equivalent set of steps). One approach uses Gaussian elimination (equivalent to the LU decomposition), while another uses the Cholesky decomposition. There are also iterative methods that generate a sequence of approximations to the solution but reduce computation (provided they are stopped before the exact solution is found).
Gentle-CS has a nice table overviewing the various factorizations (Table 5.1, page 219). I’ve reproduced a variation on it here.
Name | Representation | Restrictions | Properties | Uses |
---|---|---|---|---|
LU | solving equations; inversion | |||
QR | regression | |||
Cholesky | multivariate normal; covariance; solving equations; inversion | |||
Eigen decomposition | principal components analysis and related | |||
SVD | machine learning, topic models |
*For the eigen decomposition, I assume
** For positive definite or positive semi-definite
Triangular systems
As a preface, let’s figure out how to solve
- Now for
, use the already computed to calculate . - Repeat for all rows.
How many multiplies and adds are done? Solving lower triangular systems is very similar and involves the same number of calculations.
In Scipy, we can use linalg.solve_triangular
to solve triangular systems. The trans
argument indicates whether to solve
import scipy as sp
= np.random.default_rng(seed=1)
rng = 20
n = rng.normal(size = (n,n))
X ## R has the `crossprod` function, which would be more efficient
## than having to transpose, but numpy doesn't seem to have an equivalent.
= X.T @ X # produce a positive def. matrix
A
= rng.normal(size = n)
b = np.linalg.cholesky(A) # L is lower-triangular
L = L.T
U
= sp.linalg.solve_triangular(L, b, lower=True)
out1 = np.linalg.inv(L) @ b
out2 np.allclose(out1, out2)
True
= sp.linalg.solve_triangular(U, b, lower=False)
out3 = np.linalg.inv(U) @ b
out4 np.allclose(out1, out2)
True
To reiterate the distinction between matrix inversion and solving a system of equations, when we write
Here’s a good reason why.
import time
= np.random.default_rng(seed=1)
rng = 5000
n = rng.normal(size = (n,n))
X
## R has the `crossprod` function, which would be more efficient
## than having to transpose, but numpy doesn't seem to have an equivalent.
= X.T @ X
A = rng.normal(size = n)
b = np.linalg.cholesky(A) # L is lower-triangular
L
= time.time()
t0 = sp.linalg.solve_triangular(L, b, lower=True)
out1 - t0 time.time()
0.03161954879760742
= time.time()
t0 = np.linalg.inv(L) @ b
out2 - t0 time.time()
4.863181114196777
That assumes you have
Gaussian elimination (LU decomposition)
Gaussian elimination is a standard way of directly computing a solution for
The idea of Gaussian elimination is to convert the problem to a triangular system. In class, we’ll walk through Gaussian elimination in detail and see how it relates to the LU decomposition. I’ll describe it more briefly here. Following what we learned in algebra when we have multiple equations, we preserve the solution,
If we’re just looking for the solution of the system, we don’t need the lower-triangular factor
In class, we’ll work out the computational complexity of the LU and see that it is
If we look at help(np.linalg.solve)
in Python, we see that it uses *_gesv*. A Google search indicates that this is a Lapack routine that does the LU decomposition with partial pivoting and row interchanges (see below on what these are), so numpy is using the algorithm we’ve just discussed.
We can also explicitly get the LU decomposition in Python with scipy.linalg.lu()
, though for most use cases, what we want to do is solve a system of equations. (In R, one can’t easily get the explicit LU decomposition, though solve()
in R does use the LU.
One additional complexity is that we want to avoid dividing by very small values to avoid introducing numerical inaccuracy (we would get large values that might overwhelm whatever they are being added to, and small errors in the divisor will have large effects on the result). This can be done on the fly by interchanging equations to use the equation (row) that produces the largest value to divide by. For example in the first step, we would switch the first equation (first row) for whichever of the remaining equations has the largest value in the first column. This is called partial pivoting. The divisors are called pivots. Complete pivoting also considers interchanging columns, and while theoretically better, partial pivoting is generally sufficient and requires fewer computations. Partial pivoting can be expressed as multiplying along the way by permutation matrices,
Finally
When would we explicitly invert a matrix?
In some cases (as we discussed in class) you actually need the inverse as the output (e.g., for estimating standard errors).
But if you are just needing the inverse to multiply it by a vector or another matrix, you’re better off using a decomposition. Basically, one can work out that the cost of computing the inverse and doing subsequent multiplication with it is rather more than computing the decomposition and doing subsequent multiplications with it, regardless of how many columns there are in the matrix you are multiplying by.
With numpy, that may mean computing the decomposition and then carrying out the steps to do the multiplication using the decomposition or using np.linalg.solve(A, B)
which, as mentioned above, uses the LU behind the scenes. One would not do np.linalg.inv(A) @ B
.
Cholesky decomposition
When
- For
, - For
,- if
, then for :
We can then solve a system of equations as:
Confusingly, while numpy’s cholesky
gives cholesky
can return either
Here are two ways we can use the Cholesky to solve a system of equations:
= sp.linalg.cholesky(A)
U
sp.linalg.solve_triangular(U, =False, trans='T'),
sp.linalg.solve_triangular(U, b, lower=False)
lower
= sp.linalg.cho_factor(A)
U, lower sp.linalg.cho_solve((U, lower), b)
The Cholesky has some nice advantages over the LU: (1) while both are
Uses of the Cholesky
The standard algorithm for generating
= sp.linalg.cholesky(A, lower=True)
L = L @ rng.normal(size = n) y
Where will most of the time in this two-step calculation be spent?
If a regression design matrix,
However, for OLS, it turns out that the standard approach is to work with
We could also consider the GLS estimator, which accounts for dependence in the errors:
Write efficient Python code to carry out the GLS solution using the Cholesky factorization.
Numerical issues with eigendecompositions and Cholesky decompositions for positive definite matrices
Monahan comments that in general Gaussian elimination and the Cholesky decomposition are very stable. However, in the Cholesky case, if the matrix is very ill-conditioned we can get
= np.random.default_rng(seed=1)
rng = rng.uniform(size = 100)
locs = .1
rho = np.abs(locs[:, np.newaxis] - locs)
dists = np.exp(-dists**2/rho**2)
C = np.linalg.eig(C)
e 0])[::-1][96:100] np.sort(e[
array([-3.73181266e-16-5.84190120e-17j, -4.67035510e-16+1.89571791e-16j,
-4.67035510e-16-1.89571791e-16j, -9.16675579e-16+0.00000000e+00j])
try:
= np.linalg.cholesky(C)
L except Exception as error:
print(error)
Matrix is not positive definite
= np.abs(e[0])
vals max(vals)/np.min(vals) np.
4.4428831436221965e+17
I don’t see a way to use pivoting with the Cholesky in Python, but in R, one can do chol(C, pivot = TRUE)
.
We can think about the accuracy here as follows. Suppose we have a matrix whose diagonal elements (i.e., the variances) are order of magnitude 1 and that the true value of a
Note that when the Cholesky fails, we can still compute an eigendecomposition, but we have negative numeric eigenvalues. Even if all the eigenvalues are numerically positive (or equivalently, we’re able to get the Cholesky), errors in small eigenvalues near machine precision could have large effects when we work with the inverse of the matrix. This is what happens when we have columns of the
A strategy when working with mathematically but not numerically positive definite
QR decomposition
Introduction
The QR decomposition is available for any matrix,
There are three standard approaches for computing the QR, using (1) reflections (Householder transformations), (2) rotations (Givens transformations), or (3) Gram-Schmidt orthogonalization (see below for details).
For
We can also obtain the pseudo-inverse of
Regression and the QR
Often QR is used to fit linear models, including in R. Consider the linear model in the form
and solving for
Why use the QR instead of the Cholesky on
What about computational order of the different approaches to least squares? The Cholesky is
Regression and the QR in Python and R
We can get the Q and R matrices easily in Python.
= np.linalg.qr(X) Q,R
One of the methods used by the statsmodel
package in Python uses the QR to fit a regression.
Note that by default in Python (and in R), you get the skinny QR, namely only the first
Regression in R uses the QR decomposition via qr()
, which calls a Fortran function. qr()
(and the Fortran functions that are called) is specifically designed to output quantities useful in fitting linear models.
In R, qr()
returns the result as a list meant for use by other tools. R stores the \$qr
, while the lower triangle of $qr and $aux store the information for constructing
= qr(X)
X.qr = qr.Q(X.qr)
Q = qr.R(X.qr) R
As a side note, there are QR-based functions that provide regression-related quantities, such as qr.resid(), qr.fitted() and qr.coef(). These functions (and their Fortran counterparts) exist because one can work through the various regression quantities of interest and find their expressions in terms of
Computing the QR decomposition
Here we’ll see some of the details of the different approaches to the QR, in part because they involve some concepts that may be useful in other contexts. I won’t expect you to see all of how this works, but please skim through this to get an idea of how things are done.
One approach involves reflections of vectors and a second rotations of vectors. Reflections and rotations are transformations that are performed by orthogonal matrices. The determinant of a reflection matrix is -1 and the determinant of a rotation matrix is 1. We’ll see some of the details in the demo code.
QR Method 1: Reflections
If
Suppose we want to formulate the reflection in terms of a “Householder” matrix,
One way to create the QR decomposition is by a series of Householder transformations that create an upper triangular
where we make use of the symmetry in defining
Basically
In the regression context, as we work through the individual transformations,
Final side note: if
QR Method 2: Rotations
A Givens rotation matrix rotates a vector in a two-dimensional subspace to be axis oriented with respect to one of the two dimensions by changing the value of the other dimension. E.g. we can create
We can use a series of Givens rotations to do the QR but unless it is done carefully, more computations are needed than with Householder reflections. The basic story is that we apply a series of Givens rotations to
Note that we create the
QR Method 3: Gram-Schmidt Orthogonalization
Gram-Schmidt involves finding a set of orthonormal vectors to span the same space as a set of LIN vectors,
(normalize the first vector)Orthogonalize the remaining vectors with respect to
: , which orthogonalizes with respect to and normalizes. Note that . So we are finding a scaling, , where is based on the inner product, to remove the variation in the direction from .For
, find interim vectors, , by orthogonalizing with respect to
Proceed for
, in turn orthogonalizing and normalizing the first of the remaining vectors w.r.t. and orthogonalizing the remaining vectors w.r.t. to get new interim vectors
Mathematically, we could instead orthogonalize
Another way to think about this is that
The “tall-skinny” QR
Suppose you have a very large regression problem, with
followed by ‘reduction’ steps (this can be done in a map-reduce context) that do the
The full decomposition is then
The computation can be done in parallel (in particular it can be done with map-reduce) and the
Alternatively, there is a variant on the algorithm that processes the row-blocks of
Determinants
The absolute value of the determinant of a square matrix can be found from the product of the diagonals of the triangular matrix in any factorization that gives a triangular (including diagonal) matrix times an orthogonal matrix (or matrices) since the determinant of an orthogonal matrix is either one or minus one.
(Note of course that that is for square
In Python, the following will do it (on the log scale).
= qr(A)
Q,R = np.sum(np.log(np.abs(np.diag(R)))) magn
An alternative is the product of the diagonal elements of
For non-negative definite matrices, we know the determinant is non-negative, so the uncertainty about the sign is not an issue. For positive definite matrices, a good approach is to use the product of the diagonal elements of the Cholesky decomposition.
One can also use the product of the eigenvalues:
Computation
Computing from any of these diagonal or triangular matrices as the product of the diagonals is prone to overflow and underflow, so we always work on the log scale as the sum of the log of the values. When some of these may be negative, we can always keep track of the number of negative values and take the log of the absolute values.
Often we will have the factorization as a result of other parts of the computation, so we get the determinant for free.
We can use np.linalg.logdet()
or (definitely not recommended) np.linalg.det()
to calculate the determinant in Python. These functions use the LU decomposition.
5. Eigendecomposition and SVD
Eigendecomposition
The eigendecomposition (spectral decomposition) is useful in considering convergence of algorithms and of course for statistical decompositions such as PCA. We think of decomposing the components of variation into orthogonal patterns (the eigenvectors) with variances (eigenvalues) associated with each pattern.
Square symmetric matrices have real eigenvectors and eigenvalues, with the factorization into orthogonal
The spectral radius of
Computation
There are several methods for eigenvalues; a common one for doing the full eigendecomposition is the QR algorithm. The first step is to reduce
We won’t go into the algorithm in detail, but note that it involves manipulations and ideas we’ve seen already.
If only the largest (or the first few largest) eigenvalues and their eigenvectors are needed, which can come up in time series and Markov chain contexts, the problem is easier and can be solved by the power method. E.g., in a Markov chain context, steady state is reached through
Singular value decomposition
Let’s consider an
The SVD can be represented in more than one way. One representation is
That representation is as the sum of rank-one matrices (since each term is the scaled outer product of two vectors).
If
We can also fill out the matrices to get
Uses
The SVD is an excellent way to determine a matrix rank and to construct a pseudo-inverse (
We can use the SVD to approximate
Here’s another way to think about the SVD in terms of transformations and bases. Applying the SVD to a vector, $ UDV^{}x$, carries out the following steps:
expresses in terms of weights for the columns of .- Multiplying the result by
scales/stretches the weights. - Multiplying by
produces the result, which is a weighted combination of columns of , spanning the column-space of .
So applying the SVD transforms
Computation
The basic algorithm (Golub-Reinsch) is similar to the QR method for the eigendecomposition. We use a series of Householder transformations on the left and right to reduce
This eventually gives
Computation for large tall-skinny matrices
The SVD can also be generated from a QR decomposition. Take
6. Computation
Linear algebra in Python
Note that for many matrix decompositions, you can change whether all of the aspects of the decomposition are returned, or just some, which may speed calculations.
Given the importance of linear algebra to many (most) statistical and machine learning algorithms, to improve efficiency of the algorithms, for large problems, it’s important to see whether the linear algebra is being done using a fast linear algebra package in parallel, potentially on the GPU. How the linear algebra is done on the back end can make a huge difference in speed.
BLAS and LAPACK
In general, matrix operations in Python and R go to compiled C or Fortran code without much intermediate Python or R code, so they can actually be pretty efficient and are based on the best algorithms developed by numerical experts. The core libraries that are used are LAPACK and BLAS (the Linear Algebra PACKage and the Basic Linear Algebra Subroutines). As we’ve discussed in the parallelization unit, one way to speed up code that relies heavily on linear algebra is to make sure you have a BLAS library tuned to your machine. These include OpenBLAS (open source), Intel’s MKL, AMD’s ACML, and Apple’s vecLib. On newer Macs (Apple Silicon-based, namely using the M1, M2, M3, M4 chips), vecLib uses the Mac’s AMX co-processor.
If you use Conda, numpy will generally be linked against MKL or OpenBLAS (this will depend on the locations online of the packages being installed, i.e., the channel(s) used). With pip, numpy will generally be linked against OpenBLAS. it’s possible to install numpy so that it uses OpenBLAS. R can be linked to the shared object library file (.so file or .dylib on a Mac) for a fast BLAS. These BLAS libraries are also available in threaded versions that farm out the calculations across multiple cores or processors that share memory. More details are available in this SCF documentation.
BLAS routines do vector operations (level 1), matrix-vector operations (level 2), and dense matrix-matrix operations (level 3). Often the name of the routine has as its first letter “d”, “s”, “c” to indicate the routine is double precision, single precision, or complex. LAPACK builds on BLAS to implement standard linear algebra routines such as eigendecomposition, solutions of linear systems, a variety of factorizations, etc.
JAX and PyTorch (and GPUs)
Python packages such as JAX and PyTorch provide alternative linear algebra implementations that can exploit parallelization on CPUs and GPUs, automatically using the GPU if it is available. As discussed in Unit 5, JAX can also do just-in-time compilation.
The SCF parallelization tutorial has this example of doing matrix multiplication using PyTorch either on the CPU or a GPU.
The same tutorial also has an example of doing matrix multiplication using JAX, either on the CPU or on the GPU. Use of the CPU doesn’t give much speed up compared to numpy since as noted above numpy linked to a parallel BLAS will use multiple threads.
Exploiting known structure in matrices
Speedups and storage savings can be obtained by working with matrices stored in special formats when the matrices have special structure. E.g., we might store a symmetric matrix as a full matrix but only use the upper or lower triangle. Banded matrices and block diagonal matrices are other common formats. Banded matrices are all zero except for
Scipy provides functionality for working with matrices in various ways, including the scipy.sparse
module, which provides support for structured sparse matrices such as triangular and diagonal matrices as well as unstructured sparse matrices using various standard representations.
Some useful packages in R for matrices are Matrix, spam, and bdsmatrix. Matrix can represent a variety of rectangular matrices, including triangular, orthogonal, diagonal, etc. and provides methods for various matrix calculations that are specific to the matrix type. spam handles general sparse matrices with fast matrix calculations, in particular a fast Cholesky decomposition. bdsmatrix focuses on block-diagonal matrices, which arise frequently in contexts where there is clustering that induces within-cluster correlation and cross-cluster independence.
Sparse matrix formats
As an example of exploiting sparsity, we can use a standard format (CSR = compressed sparse row) in the Scipy sparse
module:
Consider the matrix to be row-major and store the non-zero elements in order in an array called data
. Then create a array called indptr that stores the position of the first element of each row. Finally, have a array, indices that tells the column identity of each element.
import scipy.sparse as sparse
= np.array([[0,0,1,0,10],[0,0,0,100,0],[0,0,0,0,0],[1000,0,0,0,0]])
mat = sparse.csr_array(mat)
mat mat.data
array([ 1, 10, 100, 1000])
# column indices mat.indices
array([2, 4, 3, 0], dtype=int32)
# row pointers mat.indptr
array([0, 2, 3, 3, 4], dtype=int32)
## Ideally don't first construct the dense matrix if it is large.
= sparse.csr_array((mat.data, mat.indices, mat.indptr))
mat2 mat2.toarray()
array([[ 0, 0, 1, 0, 10],
[ 0, 0, 0, 100, 0],
[ 0, 0, 0, 0, 0],
[1000, 0, 0, 0, 0]])
That’s also how things are done in the spam
package in R.
We can do a fast matrix multiply,
for(i in 1:nrows(A)){
x[i] = 0
# should also check that row is not empty...
for(j in (rowpointers[i]:(rowpointers[i+1]-1)) {
x[i] = x[i] + entries[j] * b[colindices[j]]
}
}
How many computations have we done? Only
Note that for the Cholesky of a sparse matrix, if the sparsity pattern is fixed, but the entries change, one can precompute an optimal re-ordering that retains as much sparsity in
Banded matrices
Suppose we have a banded matrix
Banded matrices come up in time series analysis. E.g., moving average (MA) models produce banded covariance structures because the covariance is zero after a certain number of lags.
Low rank updates (optional)
A transformation of the form
More generally a low rank update of
This also comes up in working with precision matrices in Bayesian problems where we may have
Basically Sherman-Morrison-Woodbury gives us matrix identities that we can use in combination with our knowledge of smart ways of solving systems of equations.
7. Iterative solutions of linear systems (optional)
Gauss-Seidel
Suppose we want to iteratively solve
- Start with an initial approximation,
. - Hold all but
constant and solve to find . - Repeat for the other rows of
(i.e., the other elements of ), finding . - Now iterate to get
, , etc. until a convergence criterion is achieved, such as or for .
Let’s consider how many operations are involved in a single update:
If we decompose
It turns out that the rate of convergence depends on the spectral radius of
Gauss-Seidel amounts to optimizing by moving in axis-oriented directions, so it can be slow in some cases.
Conjugate gradient
For positive definite
Instead of finding the minimum by following the gradient at each step (so-called steepest descent, which can give slow convergence - we’ll see a demonstration of this in the optimization unit), CG chooses directions that are mutually conjugate w.r.t.
Choose
and define the residual, (the error on the scale of ) and the direction, and set .Then iterate:
(choose step size so next error will be orthogonal to current direction - which we can express in terms of the residual, which is easily computable) (update current value) (update current residual) (choose next direction by conjugate Gram-Schmidt, starting with and removing components that are not -orthogonal to previous directions, but it turns out that is already -orthogonal to all but ).
Stop when
is sufficiently small.
The convergence of the algorithm depends in a complicated way on the eigenvalues, but in general convergence is faster when the condition number is smaller (the eigenvalues are not too spread out). CG will in principle give the exact answer in
In general, CG is used for large sparse systems.
See the extensive description from Shewchuk for more details, as well as the use of CG when
Updating a solution
Sometimes we have solved a system,