- Regression
- Binary Classification
- Multi-class Classification
- Generation
- supervised learning
- unsupervised learning
- meta learning
- reinforcement learning
- life-long learning
# Linear Regression

*linear model*

$$

y=\sum w_ix_i

$$*loss function*

$$

L=\sum (\hat{y}-y)^2

$$

overfitting :

A more complex model does not always lead to better performance on test data

Regularization :

$$

L=\sum (\hat{y}-y)^2-\lambda(w)^2

$$

Regularization make the function more smoother to reduce the effects of noise.## bias&variances

generally speaking,the more complex linear regression model have lower bias and higher variance,the more simpler model have higher bias and lower variance.so the complex model usually overfitting,the simple model usually underfitting.

how to deal overfitting?

1.more data

2.regularization## Gradient Descent Tips

- reduce the learning rate by every few epoch.

$$

n^t=\frac{n}{\sqrt{t+1}}

$$ - learnig rate cannot be one-size-fits-all.

giving different paramenters different learning rate.

$$

w^{t+1}=w^{t}-\frac{\eta^t}{\sigma^t}g^t

$$

$$

g^t=\frac{\partial L(\theta^t)}{\partial w}

$$

$$

\sigma^t=\sqrt{\frac{1}{t+1}\sum (g^t)^2}

$$ - Stochastic Gradient Descent

pick an example x^{n}in the model. - Feature Scaling
# Classification

### Bayes formula

In the Binary Classification,the Bayes formula as follow:

$$

P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)}

$$

if the probability is greater than 0.5 output Class one,else output Class two.### Gaussion Distribation

function of Gaussion Distribation:

$$

f_{u,\Sigma}(x)=\frac{1}{\sqrt{2\pi}\Sigma}exp(-\frac{1}{2}\frac{(x-u)^2}{\Sigma^2})

$$

input:vector x,output:probability of sampling x. The shape of the function determines by mean u and vavariance matrix ∑.

we have class one feature:x_{1},x_{2},x_{3},....x_{n},we assume x_{1},x_{2},x_{3},....x_{n} generate from the Gaussion function(u^{^},∑^{^}).

$$

L(\mu,\Sigma)= \prod f_{\mu,\Sigma}(x_i)

$$

$$

\hat{\mu},\hat{\Sigma}=argmax(L(\mu,\Sigma))

$$

result:

$$

\hat{\mu}=\frac{1}{n}\sum (x^n-\hat{\mu})(x^n-\hat{\mu})^T

$$

$$

\hat{\Sigma}=\frac{1}{n}

$$

simplify the Bayes formula:

$$

P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)}=\frac{1}{1+\frac{P(x|C_2)P(C_2)}{P(x|C_1)P(C_1)}}

$$

$$

z=ln\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)}

$$

so,the final formula as follow:

$$

z=wx+b

$$

$$

P(C_1|x)=\frac{1}{1+exp(-z)}=\sigma(z)

$$

the Logistic Regression formula:

$$

f_{w,b}(x)=\sigma(\sum x_iw_i+b)

$$

how to do gradient descent for Logistic Regression?

- Training data:

x_{1} |
x_{2} |
x_{3} |
x_{n} |
---|---|---|---|

C_{1} |
C_{1} |
C_{2} |
C_{1} |

Assume the data is generated based on $$f_{w,b}(x)=P(C_1|x) $$,the Loss function as follow:

$$

Loss(w,b)=f(x_1)f(x_2)(1-f(x_3))f(x_n)

$$

$$

\hat{w},\hat{b}=argmaxLoss(w,b)=argmin-lnLoss(w,b)

$$

$$

-lnLoss(w,b)=\sum -[\hat{y}lnf(x^n)+(1-\hat{y})lnf(x^n)]

$$

the loss function final is a bernoulli distribution

$$

\frac{\partial -lnLoss(w,b)}{\partial x_i}=\sum -(\hat{y^n}-f(x^n))x_i^n

$$

multi-class classification will use the softmax function:

$$

y_i=e^z_i/ \sum e^z_i

$$

cascading Logistic Regression models become Neural Network

The function of a layer of logistic regression is to transform the data into features.