Sklearn notes
Supervised Learning
Training data includes desired labels.
- KNN
- Linear Regression
- Logistic Regression
- SVM
- Decision Trees and Random Forests
- Neural networks
Unsupervised Learning
- Clustering
- k-Means
- Hierarchical Cluster Analysis (HCA)
- Expectation Maximization
- Visualization and dimensionality reduction
- Principal Component Analysis (PCA)
- Kernel PCA
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
- Association rule learning
- Apriori
- Eclat
Semisurpevised Learning
Combination of unsupervised and supervised algorithms.
For example, Google Photos clusters similar faces and then labels the faces in the rest.
Reinforcement Learning
An agent observes the environment, perfoms actions and is rewarded or penalized. And then learns the best strategy (policy) by itself. (AlphaGo)
Batch and Online Learing
- Batch learning: train, launch into production and run without learning anymore.
- Online learing: system is trained incrementally by feeding it data instances sequentially (mini-batches). Eg: stock prices
Instance-Based and Model-Based Learning
- Instance-based: compare new problem with instances seen in training and measure similarity. (KNN, SVM, RBF)
- Model-based: build a model and predict.
Overfitting
- Simplify the model: linear instead of high-degree polynomial
- Regularization: L1, L2, penalize big weights
- Clean training data
Underfitting
- Model with more parameters
- Feature engineering
- Reduce constraints (regularization hyperparameters)
Testing and Validating
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
Performance Measure
- Regression:
- Root Mean Square Error (RMSE)
- Mean Absolute Error (MAE): good if many outliers
- Classification
- Accuracy: correct / samples
- Precision: true positives / predicted positives (eg: spam flagging)
- Recall (true positive rate): true positives / real positives
- F1: harmonic mean of recall and precision
- Receiver Operating Characteristic (ROC): recall vs false positive rate (1-true negative rate)
- Confusion Matrix
Quick Look at the Data
- Jupyter Notebooks are great for this.
df.head()
,df.info()
,df.describe()
,df[col].value_counts()
- Correlation:
df.corr()
Data Cleaning
Handle NULLs using:
- Remove lines:
df.dropna(subset=[col])
- Remove column:
df.drop(col, axis=1)
- Set a value (zero, mean, median):
df[col].fillna(df[col].median())
or using anImputer
.
Text and Categorical Attributes
- LabelEncoder
- OneHotEncoder
- LabelBinarizer (text to integer and then to one-hot)
Feature scaling
ML algorithms usually don’t perform well when the input numerical attributes have very different scales.
- Min-max scaling:
MinMaxScaler
, outliers are a problem - Standardization:
StandardScaler
.
Pipeline
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Fine tune
- Grid Search
- Randomized Search
- Ensemble Methods
- Never tune looking at the test set
Multiclass
Some algorithms are multiclass but you can also train n binary classifiers (one-versus-all) or one for each pair (one-versus-one). Note: use one vs one for things that doesn’t scale (SVM). Sklearn does this automatically.
Multilabel
Outputs multiple binary labels. Evaluate using the (weighted) average F1 score.
Multioutput
Generalization of multilabel where each label can be multiclass (more than 2 values).
Bias/Variance Tradeoff
- Bias: wrong assumptions (linear when is quadratic). High bias == underfit
- Variance: sensitivity to small variations in the training data (many degrees of freedom).
- Irreducible error: noisiness of the data
Dimensionality Reduction
The more dimensions the training set has the greater risk of overfitting (high-dimensional datasets are very sparse, distance between random points are larger).
Manifold assumption: most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. (If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits.)
- PCA
- LLE
- MDS
- Isomap
- t-SNE
- LDA
Comments