Introduction to Linear Algebra & Multivariate Calculus
1.1 Why Multivariate Calculus in Machine Learning?
Multivariate calculus is fundamental to machine learning because:
- Most ML models have multiple parameters (high-dimensional spaces)
- Optimization requires understanding how functions change in multiple directions
- Neural networks rely heavily on gradient-based learning
- Concepts like gradients, Jacobians, and Hessians appear everywhere in ML
1.2 Scalars, Vectors, Matrices, and Tensors
- Scalar: Single number (0-dimensional)
- Vector: 1D array of numbers (magnitude and direction)
- Matrix: 2D array of numbers (linear transformations)
- Tensor: Generalized n-dimensional array
Interactive Example: Vector Visualization
Vector 1
Vector 2
Matrix Operations in Linear Algebra
Matrix Multiplication
For matrices A (m×n) and B (n×p), their product C = AB is an m×p matrix where:
$$ C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} $$
The number of columns in A must equal the number of rows in B.
Interactive Matrix Multiplication
Enter Matrix A elements:
Enter Matrix B elements:
Calculation steps:
Determinant of A:
Other Matrix Operations
- Transpose: Flip rows and columns
- Determinant: Scalar value for square matrices
- Inverse: Matrix that when multiplied gives identity matrix
Partial Derivatives in Multivariate Calculus
The partial derivative of a function \( f(x_1, x_2, \dots, x_n) \) with respect to \( x_i \) is:
$$ \frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \dots, x_i + h, \dots, x_n) - f(x_1, \dots, x_i, \dots, x_n)}{h} $$
It measures how the function changes as we vary one variable while holding others constant.
Interactive Partial Derivative Calculator
The Gradient Vector
Definition of Gradient
The gradient of a scalar-valued function \( f(x_1, x_2, \dots, x_n) \) is a vector containing all its partial derivatives:
$$ \nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n} \right) $$
The gradient points in the direction of steepest ascent of the function.
Interactive Gradient Calculator
The Jacobian Matrix
Functions from ℝⁿ to ℝᵐ
For a vector-valued function 𝐟: ℝⁿ → ℝᵐ with m component functions:
$$ \mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(x_1, \dots, x_n) \\ \vdots \\ f_m(x_1, \dots, x_n) \end{bmatrix} $$
The Jacobian matrix J is an m×n matrix of all first-order partial derivatives:
$$ J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} $$
Interactive Jacobian Calculator
The Hessian Matrix
Second-Order Derivatives
The Hessian matrix of a function \( f(x_1, x_2, \dots, x_n) \) is a square matrix of second-order partial derivatives:
$$ H(f) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} $$
The Hessian provides information about the local curvature of the function.
Interactive Hessian Calculator
Optimization in Multivariate Settings
Gradient Descent
Gradient descent is an iterative optimization algorithm for finding local minima:
- Start at initial point \( \mathbf{x}_0 \)
- Update rule: \( \mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k) \)
- Repeat until convergence
where \( \alpha \) is the learning rate.
Interactive Gradient Descent Demonstration
Applications in Machine Learning
Linear Regression: Gradient-Based Cost Minimization
For linear regression with hypothesis \( h_θ(x) = θ^T x \), the cost function is:
$$ J(θ) = \frac{1}{2m} \sum_{i=1}^m (h_θ(x^{(i)}) - y^{(i)})^2 $$
The gradient of the cost function with respect to θ is:
$$ \nabla_θ J(θ) = \frac{1}{m} X^T (Xθ - y) $$