The optimization set: QR factorization

Definition
Example: Least Square Problems
Gram-Schmidt Algorithm
- Some thoughts

Definition

(Taken from Prof. Sun, Peng’s slides)

The geometric (orthonormal) properties of the columns of the Q matrix is useful, as we will see in the following example of least square problems.

Example: Least Square Problems

(Reference: S. Boyd and L. Vandenberghe, Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares, Chapter 10 & 12. ->eBook<-.)

Problem definition

Suppose that we have a matrix A with the shape of $m*n (m > n)$ , so the system of linear equations $Ax = b$ (b is a $m$ -vector) is over-determined. In other words, there are more equations ( $m$ ) than variables ( $n$ ).

Over-determined systems are very common in data-driven methods. The number of the entries in the dataset (columns, m) are usually a lot larger than the dimensions of the dataset (rows, n).

For most over-determined systems, there is no $n$ -vector x such that $Ax = b$ . However, as a compromise, we can have an approximate solution $\hat x$ that minimize the residual $r = Ax - b$ . In the linear least square problem, we minimize the squared norm of the residual $r$ , that is:

$\min || r ||^2 = || Ax - b ||^2$

In other words, any vector $\hat x$ that satisfies $|| A\hat x - b ||^2 \le || Ax - b ||^2$ for all x is an solution for the least square problem.

Algebraic Interpretation

Geometric Interpretation

Consider the case for $m$ =3, $n$ =2. Let A (Shape: $m*n$ ) be as follows:

$A = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{pmatrix}$

Here, we can see that the column space of A lives in $\mathbb{R}^2$ , while vector b lives in $\mathbb{R}^3$ . Therefore, it is not possible for us to find an exact solution for most of the cases. As for the approximate solution $\hat x$ , it gives the linear combination of the column vectors of A that minimize the norm of the residual vector Ax – b.

(Visualization: DE: Ax – b)

It is worth noting that the vector Ax – b is orthogonal to the column space of A. Below is the algebraic justification for this:

Solving Least Square Problems

Here, we need to make the assumption that the column vectors of matrix A are linearly independent.

The result $\hat{x} = {(A^{T}A)}^{-1}A^{T}b$ can be written in the pseudo-inverse notation $\hat{x} = A^{\dagger}b$ .

Simplifying computation with QR factorization

Given that the column vectors of matrix A are linearly independent, we can implement QR factorization on A, which gives us a simpler way to calculate the pseudo-inverse. Moreover, empirical results suggests that using QR decomposition usually gives us condition numbers that are much smaller compared to directly calculating the pseudo-inverse.

For non-singular matrix A, we have $A^{-1} = R^{-1}Q^{T}$

*Condition numbers

Smaller condition numbers, better numerical stability!

Gram-Schmidt Algorithm

Now we look at the Gram-Schmidt Algorithm that we use to actually implement the QR factorization.

Finding the Q matrix in essentially finding orthonormal basis of column space of A. The algorithm below utilizes this orthonormal property well. For $u_1$ , we just take the unit vector $e_1$ that points into the same direction with $a_1$ . For the following $u_i$ , we compute them by subtracting $a_i$ by it’s projections onto all the orthonormal basis we’ve already computed. In this case, we make sure that the residue is orthogonal to all the existing basis (i.e. the residue cannot be expressed by a linear combination of the existing basis). After this, we simply normalize $u_i$ to get the orthogonal basis $e_i$ .

For the R matrix, we note that for orthonormal basis $e_i$ and $e_j$

$e_i \cdot e_j = 0 (i \ne j) \text{ } e_i \cdot e_j = 1 (i = j)$

The results can be easily validated by doing the matrix multiplication!

(A simple visualization)

Some thoughts

Borrowing the idea from my professor, the Gram-Schmidt algorithm can be seen as a form of incremental learning: instead of directly computing the whole Q & R matrices at the same time, we take the incremental way of subtracting the projections.

This actually reminds me of two machine learning algorithms that uses the same idea.

The residual connections in ResNet
- Instead of learning the direct mapping H(x), residual blocks learn the residual F(x)=H(x) – x. This reformulation makes it easier to optimize identity mappings – the network can simply drive F(x) to zero rather than learning an identity function from scratch. (Generated by DeepSeek)
The gradual denoising process in Diffusion Models