This page looks best with JavaScript enabled

Machine learning overview

 ·  β˜• 4 min read · πŸ‘€... views

My first post in the new machine learning series.

Although I’ve chosen https://nianze.ml as my personal website domain name, I haven’t really posted any article on Machine Learning at all, which may somehow be misleading. Considering it’s new year and my website has just been re-designed, it’s perfect time for new plans, so I’ve made a dicision to begin a new series related to ml: I’ll write down learning notes during my self-study in machine learning. Recently I’m reading the book Hands-On Machine Learning with Scikit-Learn and TensorFlow by AurΓ©lien GΓ©ron, which should be a good start for this new series.

At first the post is intended to be written in Chinese, but considering there’re so many technique terms in English that I do not know the exact Chinese translation, I’ll just start with English.

As the first post in this series, let’s just take a overview on machine learning system.

Types of machine learning

There are broadly three ways to classify machine learning systems, and each of these three could be further categorized into multiple sub-categories:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Machine learning system
β”œβ”€β”€ trained with supervision or without
β”‚   β”œβ”€β”€ supervised learning
β”‚   β”‚   β”œβ”€β”€ k-Nearest Neighbors
β”‚   β”‚   β”œβ”€β”€ Linear Regression
β”‚   β”‚   β”œβ”€β”€ Logistic Regression
β”‚   β”‚   β”œβ”€β”€ Support Vector Machines (SVMs)
β”‚   β”‚   β”œβ”€β”€ Decision Trees and Random Forests
β”‚   β”‚   └── Neural networks
β”‚   β”œβ”€β”€ unsupervised learning
β”‚   β”‚   β”œβ”€β”€ clustering
β”‚   β”‚   β”‚   β”œβ”€β”€ k-Means
β”‚   β”‚   β”‚   β”œβ”€β”€ Hierarchical Cluster Analysis (HCA)
β”‚   β”‚   β”‚   └── Expectation Maximization
β”‚   β”‚   β”œβ”€β”€ Visualization and dimensionality reduction
β”‚   β”‚   β”‚   β”œβ”€β”€ Principal Component Analysis (PCA)
β”‚   β”‚   β”‚   β”œβ”€β”€ Kernel PCA
β”‚   β”‚   β”‚   β”œβ”€β”€ Locally-Linear Embedding (LLE)
β”‚   β”‚   β”‚   └── t-distributed Stochastic Neighbor Embedding (t-SNE)
β”‚   β”‚   └── Association rule learning
β”‚   β”‚       β”œβ”€β”€ Apriori
β”‚   β”‚       └── Eclat
β”‚   β”œβ”€β”€ semisupervised learning
β”‚   └── reinforcement learning
β”œβ”€β”€ learn incrementally or in a whole batch
β”‚   β”œβ”€β”€ online learning (incremental learning)
β”‚   β”‚       β”œβ”€β”€ adapting rapidly to changing data and autonomous system
β”‚   β”‚       └── out-of-core learning (training on large quantities of data)
β”‚   └── batch learning
└── predict based on a model or not
    β”œβ”€β”€ instance-based learning (using a similarity measure)
    └── model-based learning

Main challenges of machine learning

  • Bad data
    • Insufficient quantity of training data
    • Nonrepresentative training data
    • Poor-quality data
    • Irrelevant features
  • Bad algorithm
    • Overfitting the training data
    • Underfitting the training data

We reduce overfitting by constraining the degrees of freedom the model has, which is called regularization. The amount of regularization can be controlled by a hyperparameter, which is a parameter of the learning algorithm (not of the model). The larger the hyperparameter, the smaller the model parameter, ending up with more constrain we apply to the model and less degrees of freedom.

On the other side, to solve underfitting problem, we may consider:

  • select more powerful model with more parameters
  • feed better fetures
  • reduce the constraints (e.g., reducing the regularization hyperparameter)

Testing and validating

Usually we split data into three groups:

  • training set
  • validation set
  • test set

And take following common workflow:

  1. train multiple models with various hyperparameters using the training set
  2. select the model and hyperparameters tht perform best on the validation set
  3. run a single final tets against the test set to get an estimate of the generalization error (out-of-sample error)

Further, we can use cross-validation technique to reuse data:

  1. split trainig set into complementary subsets
  2. train each model against a different combination of these subsets and validate against the remaining parts
  3. select the model type and hyperparameters with best performance
  4. train the final model by feeding the full training set to the chosen model and hyperparameters
  5. measure the generalized error on the test set

Concept checkout:

  1. How would you define Machine Learning?

    • ML is a system that can learn from data. Specifically, given performance measure, the learning will result in better performance at some tasks.
  2. Can you name four types of problems where it shines?

    1. Complex problems without known algorithmic solution
    2. Long hand-tuned rules
    3. System that needs to adapt to fluctuating environment
    4. Data mining (help humans learn)
  3. What is a labeled training set?

    • It’s a training set that contains the desired solution (a label) for each instance
  4. What are the two most common supervised tasks?

    • Regression and classification
  5. What is the purpose of test set and validation set?

    • A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production.
    • A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.
  6. Why would you prefer cross-validation?

    • Cross-validation is a technique that makes it possible to compare models (for model selection and hyperparameter tuning) without the need for a separate validation set. This saves precious training data.
Share on
Support the author with