$y = f(\vec{x})$

  • Binary classification problems - $y$ is binary, for example y in [0,1] or y in [-1,1].
    • Examples
      • Spam detection (is this email a spam or not spam).
      • Face detection (given a small image, does it contain a face).
      • Diagnosis (input a list of symptoms, output be do they have COVID-19 or not).
  • Multiclass classification problems
    • Examples
      • Recognition problems (given image, identify which object is it)
      • Voice recognition (given acoustic signal, determine who is speaking)
      • OCR

Our focus is going to be binary classification, then generalize to multiclass classification.

Decision Boundary

For example given the following test points, we want to determine if a fruit is grape or apple.

The red line here is the decision boundary, it is the line machine learning algorithm created, will be used to classify future inputs.

  • Boundary in the feature space that we learn to predict the class.
  • Linearly separable (the example above is linear separable)
    • Data can be separated by hyperplane.
    • For example the graph below is not linearly separable.

Classification by Regression

Data: $\{\vec{x}_i, y_i\}_{i=1}^N, \vec{x}_i \in R^d, y_i \in \{-1, 1\}$

Squared Loss: $\vec{w}^* arg_{\vec{w}} min \sum_{i=1}^N (y_i - \vec{w}^T\vec{x}_i)^2$

Classifier: $sgn(\vec{w}^T\vec{x})$

$sgn(z)$ = -1 if z $\leq$ -1, 1 if z > 0

Signed distance from line $\vec{w}^T\vec{x} = 0$ is simply $\vec{x}^T\frac{\vec{w}}{||\vec{w}||}$ (x projected to the unit vector). So basically we will get positive on one side, and negative on another side.

Problem with linear regression is it spends too much effort trying to get $\vec{w}^T\vec{x}$ to be close to -1 or 1. But all we care about is the sign of $\vec{w}^T\vec{x}$.

It is solving the wrong problem!

For classifiers of form $sgn(f(x))$, a natural loss is in terms of $y \cdot f(x)$.

  • If $y \cdot f(x)$ is greater than 0, then we predicted correctly.
  • Otherwise we predicted wrongly since y and f(x) must have different signs.

The issue is with the marked part of the error function, it will have penalty even for correct predictions.

Why not use 0-1 loss for binary classification problem?

  • Can’t use gradient based optimization since the function is not smooth (gradient for all errors are zero and in origin the gradient is non-existent).

K-NN Classification

  • Find $k$ training points closest to test input, and use their outputs to determine our prediction.
  • Data $\{(\vec{x}_i, y_i)\}_{i=1}^N, y_i \in \{-1, 1\}$
  • Test points $\vec{x}^*$
  • $N_k(\vec{x}^*)$: set of k training points where input are closest to $\vec{x^*}$

The prediction is $sgn(\sum_{i\in N_k(x^*)} y_i)$ - voting of k nearest points.

$sgn(\sum_{i\in N_k(x^*)} w(\vec{x}_i)y_i)$

$w(\vec{x}_i) = e^{-||\vec{x}_i-\vec{x}^*||^{\frac{2}{2\sigma^2}}}$

The question is how do we determine the hyper parameters k and $\sigma^2$.

For this type of classification, we are basically saying if a test point is closer to a test points of one class than another, then we will predit the test point as that class.

Decision boundary is piecewise lienar and very expressive.

When $k>1$ the decision boundary is blurred as we allow more nearest neighbours to vote instead of trusting one.

Decision Tree

Tree structured sequence of decision rules.

For the most part we will use binary decision trees.

  • $m$ internal (split) nodes.
  • $m+1$ leaf (terminal) nodes.
  • At each split node $j$ define a split function $t_j(\vec{x}): R^d \to \{-1, 1\}$
    • If -1 we go left, if 1 we go right.
  • Each leaf node has categorical distribution over class label.
    • $P(y = c | node \ j)$
      • Let $N_j$ be the number of traning points that reach node $j$
      • $N_{j,c}$ number of points at node $j$ with class label c.
      • Then the probability is just $\frac{N_{j,c}}{N_j}$

Learning Decision Trees

Assume at node $j$ with $N_j$ data points that reaches $j$.

Consider split functions $t_j(\vec{x})$ based on a single feature.

  • $t_j(\vec{x}) = \vec{e}_l^T \vec{x} > \tau_j = x_l > \tau_j$

Assume $l$ is fixed (we know which feature to use).

If we have $N_j$ data points, then $N_j-1$ possible thresholds.

Then how do we choose the best threshold?

Goal: Partition data so all points to the left and to the right have minimal class uncertainty.

Entropy: Measure of uncertainty of distribution.

Given categorical distribution over $k$ outcomes with probability $P_c, c \in {1…k}$, then the entropy is $-\sum_{c=1}^K P_c log_2 P_c$

So our goal is to choose the threshold to minimize uncertainty in children. This is sometimes called information gain.

Information gain is the reduction of the uncertainty as a result of a split.

  • Assume for $D$ we have $N_j$ points at node $j$.
  • $D_L$ have $N_L$ points.
  • $D_R$ have $N_R$ points.

$IG(D, split) = H(D) - \frac{N_L}{N_j} H(D_L) - \frac{N_R}{N_j} H(D_R)$

The algorithm

  • At node for each $l$, for each threashold $\tau$ compute the information gain $IG(D_j, l, \tau)$$
  • Selection of split with $l, \tau$ that maximizes the information gain.
  • Stop when $H(D_j) = 0$.

Random Decision Forests

  • Build many trees with random subset of features.
    • Or randomly choose features at internal nodes.
  • Each tree gives a different distribution over class labels for an test input.
    • $P(y=c|\vec{x}^*, treeId)$
  • Use baggnig to combine the trees
    • Take the average of distributions over all trees.