f(x) = wx + b

w,bw,b: parameters(参数/coefficients(系数)/weights(权重)

Cost Function

最常用于线性回归的成本函数Squared error cost function(平方误差成本函数):

J(w,b)=12m1m(y^(i)y(i))2J(w,b)=\frac{1}{2m} \sum_1^m (\hat{y}^{(i)}-y^{(i)})^2

其中,

y^y(i)\hat{y}-y^{(i)}称为error(误差);

mm为训练集规模;

分母多除2为了使后续 计算更简洁

y^(i)\hat{y}^{(i)}替换为fw,b(x(i)f_{w,b}(x^{(i)}等价于:

J(w,b)=12m1m(fw,b(x(i))y(i))2J(w,b)=\frac{1}{2m} \sum_1^m (f_{w,b}(x^{(i)})-y^{(i)})^2

Gradient descent 梯度下降

An algorithm that can use to try to minimize any function

Have some function:J(w,b)J(w,b)

Want: minw,bJ(w,b)\min\limits_{w,b}J(w,b)

Outline:

  • Start with some $ w,b $ (set w = 0 b = 0)

  • Keep changing w,bw,b to reduce J(w,b)J(w,b)

  • Until we settle at or near a minimum (may have >1 minimums) ==>converge (收敛到一个极小值)

    Simultaneous Update 所有参数同时同时更新

Gradient descent algorithm

w=wαwJ(w,b)(1)w = w - \alpha \frac{\partial}{\partial w}J(w,b) \tag{1}

b=bαbJ(w,b)(2)b = b - \alpha \frac{\partial}{\partial b}J(w,b) \tag{2}

其中,

α\alpha : learning rate 学习率, 通常介于0-1, 控制更新参数时的步长

if too small : work but slow

if too large: may fail to converge(收敛) and may even diverge(发散)

因此,可以visualize Loss Function关于参数每次更新时变化的图像

Linear Regression Gradient Descent Algorithm

y^(i)=f(x(i))=wx(i)+bJ(w,b)=12m1m(y^(i)y(i))2\hat{y}^{(i)} = f(x^{(i)}) = wx^{(i)} + b \\ J(w,b)=\frac{1}{2m} \sum_1^m (\hat{y}^{(i)}-y^{(i)})^2

于是,

wJ(w,b)=w12m1m(fw,b(x(i))y(i))2=w12m1m(wx(i)+by(i))2=12m1m(wx(i)+by(i))2x(i)=1m1m(fw,b(x(i))y(i))x(i)\frac{\partial}{\partial w} J(w,b) \\ =\frac{\partial}{\partial w} \frac{1}{2m} \sum_1^m (f_{w,b}(x^{(i)})-y^{(i)})^2 \\ = \frac{\partial}{\partial w} \frac{1}{2m} \sum_1^m (wx^{(i)}+b-y^{(i)})^2 \\ = \frac{1}{2m}\sum_1^m(wx^{(i)}+b-y^{(i)})2x^{(i)} \\ = \frac{1}{m}\sum_1^m(f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}

代入可得,

wαwJ(w,b)=>wα1m1m(fw,b(x(i))y(i))x(i)w - \alpha \frac{\partial}{\partial w} J(w,b) => w - \alpha \frac{1}{m} \sum_1^m (f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}

同理可得,

bαbJ(w,b)=>bα1m1m(fw,b(x(i))y(i))b - \alpha \frac{\partial}{\partial b} J(w,b) => b - \alpha \frac{1}{m} \sum_1^m (f_{w,b}(x^{(i)})-y^{(i)})

Batch Gradient Descent

Batch: Each step of gradient descent uses all the training exmaples