Linear Algebra:
Quadratic Forms
Given a square matrix \(A \in \mathbb{R}^{n \times n}\) and a vector \(x \in \mathbb{R}^n\), the scalar \(x^T A x\) is called a quadratic form:
\[ x^T A x = \sum_{i=1}^n \sum_{j=1}^n A_{i,j} x_i x_j \]
Cyclic Permutation Property
For matrices \(A, B, C\) such that \(ABC\) is square, we have:
\[ \text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB) \]
Trace Trick
Using the cyclic permutation property we can derive the trace trick, which allows us to rewrite the scalar inner product \(x^T A x\) as follows:
\[ x^T A x = \text{tr}(x^T A x) = \text{tr}(x x^T A) \]
Einstein Summation
This is a nice notational convention for simplifying expressions with tensors. The three rules are namely:
- Repeated indices are implicitly summed over
- Each index can appear at most twice in any term
- Each term must contain identical non-repeated indices.
As a slightly complex example, suppose we have a tensor \(S_{ntk}\) where n indexes examples in a batch, t indexes locations in the sequence, and k indexes words in a one-hot representation. Let \(W_{kd}\) be an embedding matrix that maps sparse one-hot vectors in \(R^k\) to dense vectors in \(R^d\). We can convert the batch of sequences of one-hots to a batch of sequences of embeddings as follows: \[ E_{ntd} = \sum_kS_{ntk}W_{kd} \] We can compute the sum of the embedding vectors of each sequence (to get a bag of words) as follows:
\[ E_{nd} = \sum_k \sum_t S_{ntk}W_{kd} \] Finally, we can pass each sequence’s vector representation through another linear transformation \(V_{dc}\) to map to the logits over a classifier with c labels:
\[ L_{nc} = \sum_dE_{nd}V_{dc}=\sum_d \sum_k \sum_tS_{ntk}W_{kd}V_{dc} \] In Einstein notation, we write \(L_{nc}=S_{ntk}W_{kd}V_{dc}\). Note that k and d are summed over because those indices appear twice on the right hand side, while we sum over t because that index does not occur on the left hand side.
Schur Complements:
Consider a general partitioned matrix \[ M = \begin{pmatrix} E & F \\ G & H \end{pmatrix} \] where we assume \(E\) and \(H\) are invertible. We have \[ M^{-1} = \begin{pmatrix} (M/H)^{-1} & -(M/H)^{-1} F H^{-1} \\ - H^{-1} G (M/H)^{-1} & H^{-1} + H^{-1} G (M/H)^{-1} F H^{-1} \end{pmatrix} \] \[ = \begin{pmatrix} E^{-1} + E^{-1} F (M/E)^{-1} G E^{-1} & - E^{-1} F (M/E)^{-1} \\ -(M/E)^{-1} G E^{-1} & (M/E)^{-1} \end{pmatrix} \] where \[ M/H := E - F H^{-1} G \] \[ M/E := H - G E^{-1} F \] We say that \(M/H\) is the schur complement of \(M\) with respect to \(H\), while \(M/E\) is the schur complement of \(M\) with respect to \(E\).
The proof can be found in Probabilistic Machine Learning: An Introduction by Kevin Murphy.
Hutchinson Trace Estimator
Sometimes it is difficult to evaluate the matrix \(A\), but it may be easy to compute the matrix products \(Av\). Suppose \(v\) is a random vector such that \(\mathbb{E}(vv^T)=I\). In this case, we can create a Monte Carlo approximation to tr(\(A\)) using the following identity: \[ \text{tr}(A)=\text{tr}(A\mathbb{E}[vv^T])=\mathbb{E}[\text{tr}(Avv^T)]=\mathbb{E}[\text{tr}(v^TAv)]=\mathbb{E}[v^TAv] \]
Probability Theory:
Invertible Transformations
Suppose \(g:\mathbb{R}^n \to \mathbb{R}^n\) is a bijection (equivalently strictly monotonic) and \(X\) is a random variable. If we are interested in calcuating the probability density function \(f_Y(y)\) of some deterministic transformation \(Y=g(X)\), the change of variable formula tells us that
\[ f_Y(y) = f_X(g^{-1}(y))|\text{det}(J_{g^{-1}}(y))| \] where \(J_{g^{-1}}(y)\) is the Jacobian matrix of \(g^{-1}\).
Disjointification:
Given any non empty set \(\Omega\) and from any seq=uence \(\{A_n\}_{n \in \mathbb{N}} \subset \mathcal{P}(\Omega)\), one can construct a pairwise disjoint sequence \(\{B_n\}_{n \in \mathbb{N}} \subset \mathcal{P}(\Omega)\) defined by \(B_1=A_1\) and for \(n \geq 2:\) \[ B_n = A_n \cap (\bigcup_{i=1}^{n-1}A_i)^c \] with the same union \[ \bigcup_{n \in \mathbb{N}}A_n = \dot{\bigcup}_{n\in \mathbb{N}}B_n \] where \(\dot{\bigcup}\) denotes a disjoint union. This lemma comes up frequently in probability/analysis especially in the context of showing countable sub-additivity or inclusion-exclusion.
Exponential Families:
Recall that if \((\Omega,\mathcal{F},\nu)\) is a measure space and \(f\) be a non-negative Borel measurable function. The set function: \[ \lambda(A)=\int_{A}fd\nu, \ A\in \mathcal{F} \] is a measure on \((\Omega,\mathcal{F})\). Note that \(\nu(A)= 0 \text{ implies }\lambda(A)=0\). If \(\nu(A)= 0 \text{ implies }\lambda(A)=0\) holds for \(\lambda\) and \(\nu\) defined on the same measurable space then we say \(\lambda\) is absolutely continuous with respect to \(\nu\) or that \(\nu\) dominates \(\lambda\).
A parametric family \(\{P_\theta:\theta\in\Theta\}\) dominated by a \(\sigma\)-finite measure \(\nu\) on \((\Omega,\mathcal{F})\) is called an exponential family if and only if: \[ \frac{d\mathbb{P}_\theta}{d\nu}(\omega)=\exp\{[\eta(\theta)]^TT(\omega)-\xi(\theta)\}h(\omega), \ \omega \in \Omega \] where T is a random p-dimensional vector, \(\eta\) is a function from \(\Theta\to\mathbb{R}^p\), h is a non-negative Borel measurable function on \((\Omega, \mathcal{F})\), and \[ \xi(\theta)=\log\{\int_\Omega\exp\{[\eta(\theta)]^TT(\omega)\}h(\omega)d\nu(\omega)\} \] A less technical, but a potentially more useful characterization for checking if a family is a exponential family is as follows:
A family of pdf or pmfs is an exponential family if it can be expressed as: \[ f(x|\theta)= h(x)c(\theta)\exp\{\sum_i^kw_i(\theta)t_i(x)\} \] where \(h(x),c(\theta)\geq0\), \(t_i\)’s are real valued functions of the observation x, and \(w_i\)’s are real valued functions of the parameter \(\theta\). Many common families such as the normal, gamma, beta, binomial, poisson, and negative binomial are examples of exponential families. The first representation given turns out to not be unique. In fact, any transformation \(\tilde{\eta}(\theta)=D\eta(\theta)\) with a \(p\times p\) non-singular matrix D gives another representation (with T replaced by \(\tilde{T}=(D^T)^{-1}T\)).
\(T\)-step Transition Probabilities
Let \(\{X_0, X_1,\ldots, X_n\}\) be a markov chain \(S\) be a state space, and \((P)_{i,j}\) be an entry of the transition matrix. If we are interested in the probability of making it from state \(i\) to state \(j\) over \(T\) steps, then the \(T\)-step probabilities are given by \[ \mathbb{P}(X_t=j|X_0=i) = \mathbb{P}(X_{t+n}=j|X_0=i)= (P^T)_{ij} \] for all n.