Logistic Regression 逻辑回归

solve binary classification 解决二分类问题

sigmoid function / logistic function:

g(z)=11+ez,0<g(z)<1g(z) = \frac{1}{1+e^{-z}},0<g(z)<1

z=wx+bz = \vec{w} \cdot \vec{x} +b

logistic regression model:

fw,b(x)=g(wx+b)=11+e(wx+b)f_{\vec{w},b}(\vec{x}) = g(\vec{w} \cdot \vec{x} +b) = \frac{1}{1+e^{-(\vec{w} \cdot \vec{x} +b)}}

means: probability that y = 1, namely $ P(y=1|x;\vec{w},b)$

Cost Function

Logistic loss function:

L(fw,b(x(i)),y(i))={log2(fw,b(x(i))),if y(i)=1log2(1fw,b(x(i))),if y(i)=0L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)})= \begin{cases} -log_2(f_{\vec{w},b}(\vec{x}^{(i)})) ,&\text{if } y^{(i)} = 1 \\ -log_2(1-f_{\vec{w},b}(\vec{x}^{(i)})) ,&\text{if } y^{(i)} = 0 \end{cases}

The further prediction fw,b(x(i))f_{\vec{w},b}(\vec{x}^{(i)}) is from target y(i)y^{(i)}, the higher the loss:预测值与实际值相距越大,损失越大。

Cost Function:

J(w,b)=1mi=1mL(fw,b(x(i)),y(i))J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^m L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)})

This cost function is convex(凸的),从而保证参数一定能收敛到使J(w,b)J(\vec{w},b)最小的值

Simplified Cost Function

Simplified loss function 上述损失函数写成一行:

L(fw,b(x(i)),y(i))=y(i)log2(fw,b(x(i)))(1y(i))log2(1fw,b(x(i)))L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)})=-y^{(i)}log_2(f_{\vec{w},b}(\vec{x}^{(i)}))-(1-y^{(i)})log_2(1-f_{\vec{w},b}(\vec{x}^{(i)}))

Simplified Cost Function:

J(w,b)=1mi=1mL(fw,b(x(i)),y(i))=1mi=1m[y(i)log2(fw,b(x(i)))+(1y(i))log2(1fw,b(x(i)))]J(\vec{w},b) = \frac{1}{m} \sum_{i=1}^m L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) \\ = -\frac{1}{m} \sum_{i=1}^m[y^{(i)}log_2(f_{\vec{w},b}(\vec{x}^{(i)}))+(1-y^{(i)})log_2(1-f_{\vec{w},b}(\vec{x}^{(i)}))]

Gradient Descent

repeat until convergence:  {      wj=wjαJ(w,b)wj            b=bαJ(w,b)b}\begin{align*} &\text{repeat until convergence:} \; \lbrace \\ & \; \; \;w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \; \\ & \; \; \; \; \;b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\ &\rbrace \end{align*}

J(w,b)wj=1mi=0m1(fw,b(x(i))y(i))xj(i)J(w,b)b=1mi=0m1(fw,b(x(i))y(i))\begin{align*} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \end{align*}

repeat until convergence:  {      wj=wjα[1mi=0m1(fw,b(x(i))y(i))xj(i)]            b=bα[1mi=0m1(fw,b(x(i))y(i))]}\begin{align*} &\text{repeat until convergence:} \; \lbrace \\ & \; \; \;w_j = w_j - \alpha [\frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} ] \; \\ & \; \; \; \; \;b = b - \alpha [\frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) ] \\ &\rbrace \end{align*}

注意,在形式上,所得到的Gradient Descent Algorithm与Linear Regression完全一致

但 二者对于  fw,b(x)  \; f_{\vec{w},b}(\vec{x}) \;的定义完全不同,如下所示,

Linear Regression: fw,b(x)=wx+bf_{\vec{w},b}(\vec{x}) = \vec{w} \cdot \vec{x} +b

Logistic Regression: fw,b(x)=11+e(wx+b)f_{\vec{w},b}(\vec{x}) = \frac{1}{1+e^{(-\vec{w} \cdot \vec{x} + b)}}