Gradient-Based Supervised Learning Machine (lecture note)

*lecture note

Neural Nets, and many other models 1

Decision Rule

$y = F(W, X)$

where $F$ is some function, and $W$ some parameter vector

Loss function

$L(W, y^i, X^i) = D(y^i, F(W, X^i))$

where $D(y, f)$ measures the “discrepancy” between A and B.

$\frac{\partial L(W, y^i, X^i)'}{\partial W} = \frac{\partial D(y^i, f)}{\partial f} \frac{\partial F(W, X^i)}{\partial W}$

Update rule

$W(t+1) = W(t) - \eta(t) \frac{\partial D(y^i, f)}{\partial f} \frac{\partial F(W, X^i)}{\partial W}$

Three questions:

• What architecture $F(W, X)$
• What loss function $L(W, y^i, X^i)$
• What optimization method