[TOC]
# Algorithms to Construct Models
| | Deci.Tree. | kNN | Poly.reg. | [Log.reg.][lr] | [SVM][svm] | [Perceptron][][^4] |
| ---------------------------------------- | --------------------------- | ------------------------------ | ------------------------- | ------------------------------------ | ---------------------------------------- | ------------------------- |
| Meant For… [^6] | Classifica. | Class. | Regress. | Class. | Class. | Class. |
| Param. To Prevent Overfit | `d` depth | `k` — # of NNs to see | `d` degree | "regulation" $\lambda$ on **weight** | "Tradeoff" $C$ on **training data** [^2] | - |
| Type | [Rule-Based Learning][rule] | [Instance-Based Learning][ibl] | [Regression Analysis][ra] | [Regression Analysis][ra] | [IBL][ibl] > [Kernel Method][km] | 0-1 Binary Classifier |
| Simplest Form | `d=1`: Deci.Stump | `k=1`: 1NN | `d=1`: lin.reg. | - | Linear SVM | - |
| Optimization Algorithm | Build Tree | Find k-th NN | Gradient Descent | Gradient Descent | Gradient Descent | Gradient Descent |
| Standardization Needed? / Model Is Scale Variant? | × | × | √ [^Exception] | √ | √ | × |
| Feature Mapping? | Perhaps Helpful | Not Useful | Perhaps Helpful | Perhaps Helpful | Required, Built-In and Core: **Kernel** | Perhaps Helpful |
| Supervised? | × | × | √ | √ | √ | √ |
| Decision Boundary Looks Like... | Stairs. Axis-parallel. | Voronoi cells. | Curve of degree `d`[^3] | Linear | Depends on Kernel | Linear (hyperplane) |
| Online/Batch Learning | Batch | Batch | Batch / Online | Batch / Online | Batch / Online | Batch / Online |
| Loss Function | 0-1 Loss[^5] | (No Training!) | sum of squared error | Log Loss ("Sigmoid") | Hinge Loss | Hinge Loss |
| Sub-Loss Func. | InfoGain / SplitInfo | Euclidean Distance | $y_i - h_\theta(x_i)$ | $y_i - h_\theta(x_i)$ | $y_i - h_\theta(x_i)$ | $y_i \cdot h_\theta(x_i)$ |
| Hypothesis Function (if any) $h_\theta(x_i)=$ [^7] | - | - | $\theta^\text{T}x_i$ | $g(\theta^\text{T}x_i)$ | $\theta^\text{T}x_i$ | $\theta^\text{T}x_i$ |
[ibl]: https://en.wikipedia.org/wiki/Instance-based_learning
[km]: https://en.wikipedia.org/wiki/Kernel_method
[svm]: https://en.wikipedia.org/wiki/Support_vector_machine
[Ra]: https://en.wikipedia.org/wiki/Instance-based_learning
[lr]: https://en.wikipedia.org/wiki/Logistic_regression
[rule]: https://en.wikipedia.org/wiki/Rule-based_machine_learning
[Perceptron]: https://en.wikipedia.org/wiki/Perceptron
[^Exception]: except un-regularized linear regression with closed form solution
[^2]: and those in the kernel, if any. e.g.: $\lambda$
[^3]: Not really "decision boundary"!
[^4]: Perceptron can be considered as a Linear SVM w/o margin (result-wise). Compared to Linreg: [here](https://stats.stackexchange.com/a/144003/78069).
[^5]: For each training example, let the tree perdict. If the perdiction is wrong, branch this leaf (1); if right, we do nothing (0).
[^6]: Of course they may be used for the other purpose too, just not so smoothly.
[^7]: In Regression, that's the $h_\theta$ — "hypothesis function with the current values of theta".
## Decision Trees
```python
def create_subtree:
if algorithm == "ID3" : calculate_score = calculate_infoGain
elif algorithm == "C4.5" : calculate_score = calculate_infoGain / calculate_splitInfo
# Main:
scores = {attribute: calculate_score(attribute,
attribute.all_possible_values)
for attribute in all attributes}
best_attribute = score.the_attribute_with_highest_score
return (best_attribute, {value: create_subtree(where best_attribute == value)
for value in best_attribute.all_possible_values})
```
## kNN
```python
return the label of the k-th nearest neighbor
```
# Formulae
## Entropy/Information
$$H(X)=-\sum _{i=1}^n P(X_i)\log_2P(X_i)$$
## Evaluation
### Accuracy
$$\text{accuracy} = \frac{\text{# of correct predictions}}{\text{# of test examples}}$$
### Error
$$\text{error} = 1-\text{accuracy} = \frac{\text{# of incorrect predictions}}{\text{# of test examples}}$$
### Precision
$$\text{precision} = \frac{\text{# of test examples predicted to be & labeled as +}}{\text{# of test examples predicted to be +}}$$
### Recall
$$\text{recall} = \frac{\text{# of test examples predicted to be & labeled as +}}{\text{# of test examples labeled to be +}}$$
## Bias-Variance Decomposition
###Expected Error
$$E\left[ \left( y-f(x) \right)^2 \right]=\text{Bias}_f^2+\text{Variance}_f+\text{Noise}$$
### Bias
$\mathrm {Bias} {\big [}{\hat {f}}(x){\big ]}=\mathrm {E} {\big [}{\hat {f}}(x)-f(x){\big ]}$
The error caused by the simplifying assumptions built into the method. / The error caused by using a simpler model to approximate data w/ a more complex trend.
- **Low Bias**: Suggests less assumptions about the form of the target function.
- **High-Bias**: Suggests more assumptions about the form of the target function.
### Variance
How much the model will move around its mean if we provided different set of training data.
- **Low Variance**: Suggests small changes to the estimate of the target function with changes to the training dataset.
- **High Variance**: Suggests large changes to the estimate of the target function with changes to the training dataset.
## VC Dimensions
If you can find a set of $n$ points, so that it can be shattered by the classifier (i.e. classify *all* possible $2n$ labelings correctly) and you **cannot** find **any** set of $n+1$ points that can be shattered (i.e. for any set of $n+1$ points there is at least one labeling order so that the classifier can not seperate all points correctly), then the VC dimension is $n$.
Example: A line can shatter 3 points.
![](https://i.stack.imgur.com/nN0qv.png)
#Characteristics of Decision Boundaries of Each ML Algorithm & Each Kernels
![](https://discourse-cdn-sjc1.com/business/uploads/analyticsvidhya/original/2X/b/bed3ff6eed958ecae0f319de80c0b31de3617d36.png)
- **Random Forest & AdaBoost w/ weak hypothesis == decision boundary:** Much alike, but Adaboost leaves certain blocks in the hypothesis space unable to be determined.
- **Logisitic Regression & Linear Regression & Linear SVM:** Gives linear decision boundaries.
- **Decision Tree:** Stairs. Axis-parallel.
- **Nearest Neighbor:** Voronoi cells.
![](https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Euclidean_Voronoi_diagram.svg/1200px-Euclidean_Voronoi_diagram.svg.png)
# Problem Setting of Regression Models
1. Load raw data file.
2. (Optional) Make more features using `mapFeatures()`.
3. Split data into training set and test set.
4. Separately **Standardize** two datasets.
5. Input — $d$ features of $n$ **training examples**: $X$.
6. Prepend a **column of $1$'s** to $X$.
7. Apply our **model** $h_\theta(x)$ — A model is what maps an example $x$ to a label $y$ (this process is called _perdiction_). _(This function itself is called the **activation function** of this model.)_
- Linear Regression uses **Linear Model**: $h_\theta (x)=\theta^\text{T}x$.
- Logistic Regression uses **Logistic Model**: $h_\theta (x)=\frac1{1+e^{-\theta^\text{T}x}}$.
- The *Logistic / Sigmoid Function* wraps over, and “replaces”, the "error function" $h_\theta(x)$.
- Perceptron: $h_\theta (x)=\text{sign}(\theta^\text{T}x)$
8. Use **gradient descent** — how much our $\theta$ have to change, in order to achieve lower cost.
1. Calculate the **gradient** of the *cost function* w.r.t. features $j=1,…,d+1$:
- For **Linear Reg**.: ==$\frac1{n}$==$\sum_{i=1}^n [h_\theta (x_i)-y_i]\cdot x_j$
- For **Logreg**: $\sum_{i=1}^n [h_\theta (x_i)-y_i]\cdot x_j$
BTW, it's the derivative of the **objective function** $J(\theta)$, sum of **cost functions** (errors) in training:
- For **Linear Reg**., squared errors: $J(\theta)=\sum_{i=1}^n \frac1{2n} (h_\theta (x_i)-y_i)^2$.
- For **Logreg**, individual error weighted with $x_i$: $J(\theta)=-\sum_{i=1}^n[y_i\log h_\theta (x_i)+(1-y_i)\cdot \log(1-h_\theta (x_i))]$.
- For **Perceptron** (under Batch Learning): $J(\theta)=\frac1n \sum_{i=1}^n\max(0,-y_i \theta^\text{T}x_i)$.
2. Add step control $\alpha$, and optionally add regularization $\lambda$:
- For **Linear Reg**.: $\nabla\equiv$ ==$\alpha$==$\frac1{n}\sum_{i=1}^n [h_\theta (x_i)-y_i]\cdot x_j$
- For **Logreg**: $\nabla\equiv$ ==$\alpha$== $\{\sum_{i=1}^n [h_\theta (x_i)-y_i]\cdot x_j$==$+\lambda\theta_j$==$\}$ (but no $\lambda$ if $j=1$)
3. Update model parameters $\theta$ with the grad.: $\theta \leftarrow \theta-\nabla$
- **Perceptron Rule:**
- **Online Learning**: $\theta \leftarrow \theta+ y_i\cdot x_i$ only upon misclassification.
- **Batch Learning**: $\theta \leftarrow\theta+\alpha\cdot\Delta$, where $\Delta=\sum y_i\cdot x_i$ that are misclassified.
4. Repeat from Step 1 till convergence (or max step count exceeded).
-----
5. To use the model, we simply calculate $h_\theta(x)$. Again,
- Linear Regression uses **Linear Model**: $h_\theta (x)=\theta^\text{T}x$.
- Logistic Regression uses **Logistic Model**: $h_\theta (x)=\frac1{1+e^{-\theta^\text{T}x}}$.
- Remember that the motivation of inventing Logreg is to get classifications instead of predictions (like linear reg gives). Therefore, a `round()` is needed.
- This is NOT to say that we cannot use linreg for prediction; it's just not meant for that.
# Parameters
## `C` in SVM
> The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For **large** values of **C**, the optimization will choose a **smaller-margin** hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.[^1]
[^1]: Source:
The SVM has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
## $\lambda$
Regularization factor. Found in $\nabla\equiv\alpha\{\sum_{i=1}^n [h_\theta (x_i)-y_i]\cdot x_j+$==$\lambda$==$\theta_j\}$ (but no $\lambda$ if $j=1$).
Increasing $\lambda$, we can reduce variance but increase bias.
The regularization parameter $\lambda$ is a control on your fitting parameters. As the magnitues of the fitting parameters increase, there will be an increasing penalty on the cost function. **This penalty is dependent on the squares of the parameters as well as the magnitude of $\lambda$.** Also, notice that the summation after $λ$ does not include $\theta_{0}^{2}$.
Visually, increasing $λ$, see [this](http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex5/ex5.html).
## $k$
Number of neighbors to consider.
Used in kNN classifiers.
# Appendix
- Too little or too much training data could both cause overfitting.