f(x) = wx + b
w,b: parameters(参数/coefficients(系数)/weights(权重)
Cost Function
最常用于线性回归的成本函数Squared error cost function(平方误差成本函数):
J(w,b)=2m11∑m(y^(i)−y(i))2
其中,
y^−y(i)称为error(误差);
m为训练集规模;
分母多除2为了使后续 计算更简洁
将y^(i)替换为fw,b(x(i)等价于:
J(w,b)=2m11∑m(fw,b(x(i))−y(i))2
Gradient descent 梯度下降
An algorithm that can use to try to minimize any function
Have some function:J(w,b)
Want: w,bminJ(w,b)
Outline:
-
Start with some $ w,b $ (set w = 0 b = 0)
-
Keep changing w,b to reduce J(w,b)
-
Until we settle at or near a minimum (may have >1 minimums) ==>converge (收敛到一个极小值)
Simultaneous Update 所有参数同时同时更新
Gradient descent algorithm
w=w−α∂w∂J(w,b)(1)
b=b−α∂b∂J(w,b)(2)
其中,
α : learning rate 学习率, 通常介于0-1, 控制更新参数时的步长
if too small : work but slow
if too large: may fail to converge(收敛) and may even diverge(发散)
因此,可以visualize Loss Function关于参数每次更新时变化的图像
Linear Regression Gradient Descent Algorithm
y^(i)=f(x(i))=wx(i)+bJ(w,b)=2m11∑m(y^(i)−y(i))2
于是,
∂w∂J(w,b)=∂w∂2m11∑m(fw,b(x(i))−y(i))2=∂w∂2m11∑m(wx(i)+b−y(i))2=2m11∑m(wx(i)+b−y(i))2x(i)=m11∑m(fw,b(x(i))−y(i))x(i)
代入可得,
w−α∂w∂J(w,b)=>w−αm11∑m(fw,b(x(i))−y(i))x(i)
同理可得,
b−α∂b∂J(w,b)=>b−αm11∑m(fw,b(x(i))−y(i))
Batch Gradient Descent
Batch: Each step of gradient descent uses all the training exmaples