Mathematical Tips and Tricks

In mathematics and statistics, there are often useful ‘tips and tricks’ to know. I hope to document at least a few of these. This post will likely be forever a work in progress, as I come across new ‘tips and tricks.’
Probability
Linear Algebra
Problem Solving
Author

Alex Yuan

Linear Algebra:

Quadratic Forms

Given a square matrix \(A \in \mathbb{R}^{n \times n}\) and a vector \(x \in \mathbb{R}^n\), the scalar \(x^T A x\) is called a quadratic form:

\[ x^T A x = \sum_{i=1}^n \sum_{j=1}^n A_{i,j} x_i x_j \]

Cyclic Permutation Property

For matrices \(A, B, C\) such that \(ABC\) is square, we have:

\[ \text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB) \]

Trace Trick

Using the cyclic permutation property we can derive the trace trick, which allows us to rewrite the scalar inner product \(x^T A x\) as follows:

\[ x^T A x = \text{tr}(x^T A x) = \text{tr}(x x^T A) \]

Einstein Summation

This is a nice notational convention for simplifying expressions with tensors. The three rules are namely:

  1. Repeated indices are implicitly summed over
  2. Each index can appear at most twice in any term
  3. Each term must contain identical non-repeated indices.

As a slightly complex example, suppose we have a tensor \(S_{ntk}\) where n indexes examples in a batch, t indexes locations in the sequence, and k indexes words in a one-hot representation. Let \(W_{kd}\) be an embedding matrix that maps sparse one-hot vectors in \(R^k\) to dense vectors in \(R^d\). We can convert the batch of sequences of one-hots to a batch of sequences of embeddings as follows: \[ E_{ntd} = \sum_kS_{ntk}W_{kd} \] We can compute the sum of the embedding vectors of each sequence (to get a bag of words) as follows:

\[ E_{nd} = \sum_k \sum_t S_{ntk}W_{kd} \] Finally, we can pass each sequence’s vector representation through another linear transformation \(V_{dc}\) to map to the logits over a classifier with c labels:

\[ L_{nc} = \sum_dE_{nd}V_{dc}=\sum_d \sum_k \sum_tS_{ntk}W_{kd}V_{dc} \] In Einstein notation, we write \(L_{nc}=S_{ntk}W_{kd}V_{dc}\). Note that k and d are summed over because those indices appear twice on the right hand side, while we sum over t because that index does not occur on the left hand side.

Schur Complements

Consider a general partitioned matrix \[ M = \begin{pmatrix} E & F \\ G & H \end{pmatrix} \] where we assume \(E\) and \(H\) are invertible. We have \[ M^{-1} = \begin{pmatrix} (M/H)^{-1} & -(M/H)^{-1} F H^{-1} \\ - H^{-1} G (M/H)^{-1} & H^{-1} + H^{-1} G (M/H)^{-1} F H^{-1} \end{pmatrix} \] \[ = \begin{pmatrix} E^{-1} + E^{-1} F (M/E)^{-1} G E^{-1} & - E^{-1} F (M/E)^{-1} \\ -(M/E)^{-1} G E^{-1} & (M/E)^{-1} \end{pmatrix} \] where \[ M/H := E - F H^{-1} G \] \[ M/E := H - G E^{-1} F \] We say that \(M/H\) is the schur complement of \(M\) with respect to \(H\), while \(M/E\) is the schur complement of \(M\) with respect to \(E\).

The proof can be found in Probabilistic Machine Learning: An Introduction by Kevin Murphy.

Probability Theory:

Invertible Transformations

Suppose \(g:\mathbb{R}^n \to \mathbb{R}^n\) is a bijection (equivalently strictly monotonic) and \(X\) is a random variable. If we are interested in calcuating the probability density function \(f_Y(y)\) of some deterministic transformation \(Y=g(X)\), the change of variable formula tells us that

\[ f_Y(y) = f_X(g^{-1}(y))|\text{det}(J_{g^{-1}}(y))| \] where \(J_{g^{-1}}(y)\) is the Jacobian matrix of \(g^{-1}\).

\(T\)-step Transition Probabilities

Let \(\{X_0, X_1,\ldots, X_n\}\) be a markov chain \(S\) be a state space, and \((P)_{i,j}\) be an entry of the transition matrix. If we are interested in the probability of making it from state \(i\) to state \(j\) over \(T\) steps, then the \(T\)-step probabilities are given by \[ \mathbb{P}(X_t=j|X_0=i) = \mathbb{P}(X_{t+n}=j|X_0=i)= (P^T)_{ij} \] for all n. 

Disjointification:

Given any non empty set \(\Omega\) and from any sequence \(\{A_n\}_{n \in \mathbb{N}} \subset \mathcal{P}(\Omega)\), one can construct a pairwise disjoint sequence \(\{B_n\}_{n \in \mathbb{N}} \subset \mathcal{P}(\Omega)\) defined by \(B_1=A_1\) and for \(n \geq 2:\) \[ B_n = A_n \cap (\bigcup_{i=1}^{n-1}A_i)^c \] with the same union \[ \bigcup_{n \in \mathbb{N}}A_n = \dot{\bigcup}_{n\in \mathbb{N}}B_n \] where \(\dot{\bigcup}\) denotes a disjoint union. This lemma comes up frequently in probability/analysis especially in the context of showing countable sub-additivity or inclusion-exclusion.