Training data includes desired labels.
- Linear Regression
- Logistic Regression
- Decision Trees and Random Forests
- Neural networks
- Hierarchical Cluster Analysis (HCA)
- Expectation Maximization
- Visualization and dimensionality reduction
- Principal Component Analysis (PCA)
- Kernel PCA
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
- Association rule learning
Combination of unsupervised and supervised algorithms.
For example, Google Photos clusters similar faces and then labels the faces in the rest.
An agent observes the environment, perfoms actions and is rewarded or penalized. And then learns the best strategy (policy) by itself. (AlphaGo)
Batch and Online Learing
- Batch learning: train, launch into production and run without learning anymore.
- Online learing: system is trained incrementally by feeding it data instances sequentially (mini-batches). Eg: stock prices
Instance-Based and Model-Based Learning
- Instance-based: compare new problem with instances seen in training and measure similarity. (KNN, SVM, RBF)
- Model-based: build a model and predict.
- Simplify the model: linear instead of high-degree polynomial
- Regularization: L1, L2, penalize big weights
- Clean training data
- Model with more parameters
- Feature engineering
- Reduce constraints (regularization hyperparameters)
Testing and Validating
from sklearn.model_selection import train_test_split train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
from sklearn.model_selection import StratifiedShuffleSplit split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) for train_index, test_index in split.split(housing, housing["income_cat"]): strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index]
from sklearn.model_selection import cross_val_score scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10) rmse_scores = np.sqrt(-scores)
- Root Mean Square Error (RMSE)
- Mean Absolute Error (MAE): good if many outliers
- Accuracy: correct / samples
- Precision: true positives / predicted positives (eg: spam flagging)
- Recall (true positive rate): true positives / real positives
- F1: harmonic mean of recall and precision
- Receiver Operating Characteristic (ROC): recall vs false positive rate (1-true negative rate)
- Confusion Matrix
Quick Look at the Data
- Jupyter Notebooks are great for this.
Handle NULLs using:
- Remove lines:
- Remove column:
- Set a value (zero, mean, median):
df[col].fillna(df[col].median())or using an
Text and Categorical Attributes
- LabelBinarizer (text to integer and then to one-hot)
ML algorithms usually don’t perform well when the input numerical attributes have very different scales.
- Min-max scaling:
MinMaxScaler, outliers are a problem
from sklearn.pipeline import FeatureUnion num_attribs = list(housing_num) cat_attribs = ["ocean_proximity"] num_pipeline = Pipeline([ ('selector', DataFrameSelector(num_attribs)), ('imputer', Imputer(strategy="median")), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()), ]) cat_pipeline = Pipeline([ ('selector', DataFrameSelector(cat_attribs)), ('label_binarizer', LabelBinarizer()), ]) full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline), ])
- Grid Search
- Randomized Search
- Ensemble Methods
- Never tune looking at the test set
Some algorithms are multiclass but you can also train n binary classifiers (one-versus-all) or one for each pair (one-versus-one). Note: use one vs one for things that doesn’t scale (SVM). Sklearn does this automatically.
Outputs multiple binary labels. Evaluate using the (weighted) average F1 score.
Generalization of multilabel where each label can be multiclass (more than 2 values).
- Bias: wrong assumptions (linear when is quadratic). High bias == underfit
- Variance: sensitivity to small variations in the training data (many degrees of freedom).
- Irreducible error: noisiness of the data
The more dimensions the training set has the greater risk of overfitting (high-dimensional datasets are very sparse, distance between random points are larger).
Manifold assumption: most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. (If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits.)