# Sklearn notes

## Supervised Learning

Training data includes desired labels.

- KNN
- Linear Regression
- Logistic Regression
- SVM
- Decision Trees and Random Forests
- Neural networks

## Unsupervised Learning

- Clustering
- k-Means
- Hierarchical Cluster Analysis (HCA)
- Expectation Maximization

- Visualization and dimensionality reduction
- Principal Component Analysis (PCA)
- Kernel PCA
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)

- Association rule learning
- Apriori
- Eclat

## Semisurpevised Learning

Combination of unsupervised and supervised algorithms.

For example, Google Photos clusters similar faces and then labels the faces in the rest.

## Reinforcement Learning

An agent observes the environment, perfoms actions and is rewarded or penalized. And then learns the best strategy (policy) by itself. (AlphaGo)

## Batch and Online Learing

- Batch learning: train, launch into production and run without learning anymore.
- Online learing: system is trained incrementally by feeding it data instances sequentially (mini-batches). Eg: stock prices

## Instance-Based and Model-Based Learning

- Instance-based: compare new problem with instances seen in training and measure similarity. (KNN, SVM, RBF)
- Model-based: build a model and predict.

## Overfitting

- Simplify the model: linear instead of high-degree polynomial
- Regularization: L1, L2, penalize big weights
- Clean training data

## Underfitting

- Model with more parameters
- Feature engineering
- Reduce constraints (regularization hyperparameters)

## Testing and Validating

```
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
```

```
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
```

```
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
```

## Performance Measure

- Regression:
- Root Mean Square Error (RMSE)
- Mean Absolute Error (MAE): good if many outliers

- Classification
- Accuracy: correct / samples
- Precision: true positives / predicted positives (eg: spam flagging)
- Recall (true positive rate): true positives / real positives
- F1: harmonic mean of recall and precision
- Receiver Operating Characteristic (ROC): recall vs false positive rate (1-true negative rate)
- Confusion Matrix

## Quick Look at the Data

- Jupyter Notebooks are great for this.
`df.head()`

,`df.info()`

,`df.describe()`

,`df[col].value_counts()`

- Correlation:
`df.corr()`

## Data Cleaning

Handle NULLs using:

- Remove lines:
`df.dropna(subset=[col])`

- Remove column:
`df.drop(col, axis=1)`

- Set a value (zero, mean, median):
`df[col].fillna(df[col].median())`

or using an`Imputer`

.

## Text and Categorical Attributes

- LabelEncoder
- OneHotEncoder
- LabelBinarizer (text to integer and then to one-hot)

## Feature scaling

ML algorithms usually don’t perform well when the input numerical attributes have very different scales.

- Min-max scaling:
`MinMaxScaler`

, outliers are a problem - Standardization:
`StandardScaler`

.

## Pipeline

```
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
```

## Fine tune

- Grid Search
- Randomized Search
- Ensemble Methods
- Never tune looking at the test set

## Multiclass

Some algorithms are multiclass but you can also train **n** binary classifiers (one-versus-all) or one for each pair (one-versus-one). Note: use one vs one for things that doesn’t scale (SVM). Sklearn does this automatically.

## Multilabel

Outputs multiple binary labels. Evaluate using the (weighted) average F1 score.

## Multioutput

Generalization of multilabel where each label can be multiclass (more than 2 values).

## Bias/Variance Tradeoff

- Bias: wrong assumptions (linear when is quadratic). High bias == underfit
- Variance: sensitivity to small variations in the training data (many degrees of freedom).
- Irreducible error: noisiness of the data

## Dimensionality Reduction

The more dimensions the training set has the greater risk of overfitting (high-dimensional datasets are very sparse, distance between random points are larger).

Manifold assumption: most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. (If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits.)

- PCA
- LLE
- MDS
- Isomap
- t-SNE
- LDA

## Comments