Sklearn notes

2 minute read


Supervised Learning

Training data includes desired labels.

  • KNN
  • Linear Regression
  • Logistic Regression
  • SVM
  • Decision Trees and Random Forests
  • Neural networks

Unsupervised Learning

  • Clustering
    • k-Means
    • Hierarchical Cluster Analysis (HCA)
    • Expectation Maximization
  • Visualization and dimensionality reduction
    • Principal Component Analysis (PCA)
    • Kernel PCA
    • Locally-Linear Embedding (LLE)
    • t-distributed Stochastic Neighbor Embedding (t-SNE)
  • Association rule learning
    • Apriori
    • Eclat

Semisurpevised Learning

Combination of unsupervised and supervised algorithms.

For example, Google Photos clusters similar faces and then labels the faces in the rest.

Reinforcement Learning

An agent observes the environment, perfoms actions and is rewarded or penalized. And then learns the best strategy (policy) by itself. (AlphaGo)

Batch and Online Learing

  • Batch learning: train, launch into production and run without learning anymore.
  • Online learing: system is trained incrementally by feeding it data instances sequentially (mini-batches). Eg: stock prices

Instance-Based and Model-Based Learning

  • Instance-based: compare new problem with instances seen in training and measure similarity. (KNN, SVM, RBF)
  • Model-based: build a model and predict.


  • Simplify the model: linear instead of high-degree polynomial
  • Regularization: L1, L2, penalize big weights
  • Clean training data


  • Model with more parameters
  • Feature engineering
  • Reduce constraints (regularization hyperparameters)

Testing and Validating

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

Performance Measure

  • Regression:
    • Root Mean Square Error (RMSE)
    • Mean Absolute Error (MAE): good if many outliers
  • Classification
    • Accuracy: correct / samples
    • Precision: true positives / predicted positives (eg: spam flagging)
    • Recall (true positive rate): true positives / real positives
    • F1: harmonic mean of recall and precision
    • Receiver Operating Characteristic (ROC): recall vs false positive rate (1-true negative rate)
    • Confusion Matrix

Quick Look at the Data

  • Jupyter Notebooks are great for this.
  • df.head(),, df.describe(), df[col].value_counts()
  • Correlation: df.corr()

Data Cleaning

Handle NULLs using:

  • Remove lines: df.dropna(subset=[col])
  • Remove column: df.drop(col, axis=1)
  • Set a value (zero, mean, median): df[col].fillna(df[col].median()) or using an Imputer.

Text and Categorical Attributes

  • LabelEncoder
  • OneHotEncoder
  • LabelBinarizer (text to integer and then to one-hot)

Feature scaling

ML algorithms usually don’t perform well when the input numerical attributes have very different scales.

  • Min-max scaling: MinMaxScaler, outliers are a problem
  • Standardization: StandardScaler.


from sklearn.pipeline import FeatureUnion

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', LabelBinarizer()),

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),

Fine tune

  • Grid Search
  • Randomized Search
  • Ensemble Methods
  • Never tune looking at the test set


Some algorithms are multiclass but you can also train n binary classifiers (one-versus-all) or one for each pair (one-versus-one). Note: use one vs one for things that doesn’t scale (SVM). Sklearn does this automatically.


Outputs multiple binary labels. Evaluate using the (weighted) average F1 score.


Generalization of multilabel where each label can be multiclass (more than 2 values).

Bias/Variance Tradeoff

  • Bias: wrong assumptions (linear when is quadratic). High bias == underfit
  • Variance: sensitivity to small variations in the training data (many degrees of freedom).
  • Irreducible error: noisiness of the data

Dimensionality Reduction

The more dimensions the training set has the greater risk of overfitting (high-dimensional datasets are very sparse, distance between random points are larger).

Manifold assumption: most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. (If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits.)

  • PCA
  • LLE
  • MDS
  • Isomap
  • t-SNE
  • LDA