Cross Validation Explained: Evaluating estimator performance. (2024)

Cross Validation Explained: Evaluating estimator performance. (3)

The ultimate goal of a Machine Learning Engineer or a Data Scientist is to develop a Model in order to get Predictions on New Data or Forecast some events for future on Unseen data. A Good Model is not the one that gives accurate predictions on the known data or training data but the one which gives good predictions on the new data and avoids overfitting and underfitting.

After completing this tutorial, you will know:

  • That why to use cross validation is a procedure used to estimate the skill of the model on new data.
  • There are common tactics that you can use to select the value of k for your dataset.
  • There are commonly used variations on cross-validation such as stratified and LOOCV that are available in scikit-learn.
  • Practical Implementation of k-Fold Cross Validation in Python

To derive a solution we should first understand the problem. Before we proceed to Understanding Cross Validation let us first understand Overfitting and Underfitting

Understanding Underfitting and Overfitting:

Overfit Model: Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well.

Overfitting a model result in good accuracy for training data set but poor results on new data sets. Such a model is not of any use in the real world as it is not able to predict outcomes for new cases.

Underfit Model: Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Underfitting is often a result of an excessively simple model. By simple we mean that the missing data is not handled properly, no outlier treatment, removing of irrelevant features or features which do not contribute much to the predictor variable.

Cross Validation Explained: Evaluating estimator performance. (4)

How to tackle Problem of Overfitting:

The answer is Cross Validation

A key challenge with overfitting, and with machine learning in general, is that we can’t know how well our model will perform on new data until we actually test it.

To address this, we can split our initial dataset into separate training and test subsets.

There are different types of Cross Validation Techniques but the overall concept remains the same,

To partition the data into a number of subsets

Hold out a set at a time and train the model on remaining set

Test model on hold out set

Repeat the process for each subset of the dataset

Cross Validation Explained: Evaluating estimator performance. (5)

Types of Cross Validation:

•K-Fold Cross Validation

•Stratified K-fold Cross Validation

•Leave One Out Cross Validation

Let’s understand each type one by one

k-Fold Cross Validation:

Cross Validation Explained: Evaluating estimator performance. (6)

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

If k=5 the dataset will be divided into 5 equal parts and the below process will run 5 times, each time with a different holdout set.

1. Take the group as a holdout or test data set

2. Take the remaining groups as a training data set

3. Fit a model on the training set and evaluate it on the test set

4. Retain the evaluation score and discard the model

At the end of the above process Summarize the skill of the model using the sample of model evaluation scores.

How to decide the value of k?

The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.

A value of k=10 is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset.

If a value for k is chosen that does not evenly split the data sample, then one group will contain a remainder of the examples. It is preferable to split the data sample into k groups with the same number of samples, such that the sample of model skill scores are all equivalent.

Stratified k-Fold Cross Validation:

Same as K-Fold Cross Validation, just a slight difference

The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.

In below image, the stratified k-fold validation is set on basis of Gender whether M or F

Cross Validation Explained: Evaluating estimator performance. (7)

Leave One Out Cross Validation (LOOCV):

This approach leaves 1 data point out of training data, i.e. if there are n data points in the original sample then, n-1 samples are used to train the model and p points are used as the validation set. This is repeated for all combinations in which the original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness.

The number of possible combinations is equal to the number of data points in the original sample or n.

Cross Validation Explained: Evaluating estimator performance. (8)

Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate over-fitting.

Implementation of Cross Validation In Python:

We do not need to call the fit method separately while using cross validation, the cross_val_score method fits the data itself while implementing the cross-validation on data. Below is the example for using k-fold cross validation.

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.model_selection import cross_val_score
#read csv file
data = pd.read_csv("D://RAhil//Kaggle//Data//Iris.csv")#Create Dependent and Independent Datasets based on our Dependent #and Independent featuresX = data[['SepalLengthCm','SepalWidthCm','PetalLengthCm']]
y= data['Species']
model = svm.SVC()accuracy = cross_val_score(model, X, y, scoring='accuracy', cv = 10)
print(accuracy)
#get the mean of each fold
print("Accuracy of Model with Cross Validation is:",accuracy.mean() * 100)

Output:

Cross Validation Explained: Evaluating estimator performance. (9)

The Accuracy of the model is the average of the accuracy of each fold.

In this tutorial, you discovered why do we need to use Cross Validation, gentle introduction to different types of cross validation techniques and practical example of k-fold cross validation procedure for estimating the skill of machine learning models.

Specifically, you learned:

  • That cross validation is a procedure used to avoid overfitting and estimate the skill of the model on new data.
  • There are common tactics that you can use to select the value of k for your dataset.
  • There are commonly used variations on cross-validation, such as stratified and repeated, that are available in scikit-learn.

If you liked this blog give it some CLAPS and SHARE it with your friends, you can find more interesting articles here, stay tuned for more interesting techniques and concepts of Machine Learning.

Cross Validation Explained: Evaluating estimator performance. (2024)

References

Top Articles
Apple Pie - Instant Veg
The Ultimate Scottsdale, Arizona City Guide
Byrn Funeral Home Mayfield Kentucky Obituaries
Puretalkusa.com/Amac
craigslist: south coast jobs, apartments, for sale, services, community, and events
Craigslist - Pets for Sale or Adoption in Zeeland, MI
Moe Gangat Age
Craigslist Boats For Sale Seattle
Summoners War Update Notes
Lax Arrivals Volaris
Hoe kom ik bij mijn medische gegevens van de huisarts? - HKN Huisartsen
Playgirl Magazine Cover Template Free
Mary Kay Lipstick Conversion Chart PDF Form - FormsPal
6813472639
Uc Santa Cruz Events
50 Shades Darker Movie 123Movies
Nick Pulos Height, Age, Net Worth, Girlfriend, Stunt Actor
Barber Gym Quantico Hours
PCM.daily - Discussion Forum: Classique du Grand Duché
Milwaukee Nickname Crossword Clue
Craigslist Rome Ny
Black Panther 2 Showtimes Near Epic Theatres Of Palm Coast
Unity Webgl Car Tag
Craigslist Comes Clean: No More 'Adult Services,' Ever
2004 Honda Odyssey Firing Order
O'reilly's In Monroe Georgia
Pioneer Library Overdrive
Courtney Roberson Rob Dyrdek
Otis Inmate Locator
UPC Code Lookup: Free UPC Code Lookup With Major Retailers
Roadtoutopiasweepstakes.con
Sun Haven Pufferfish
The Wichita Beacon from Wichita, Kansas
Texters Wish You Were Here
Scanning the Airwaves
KM to M (Kilometer to Meter) Converter, 1 km is 1000 m
Elizaveta Viktorovna Bout
Busted Newspaper Campbell County KY Arrests
World Social Protection Report 2024-26: Universal social protection for climate action and a just transition
Subdomain Finder
VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium
Promo Code Blackout Bingo 2023
Expendables 4 Showtimes Near Malco Tupelo Commons Cinema Grill
4k Movie, Streaming, Blu-Ray Disc, and Home Theater Product Reviews & News
About Us
Darkglass Electronics The Exponent 500 Test
Greg Steube Height
The Quiet Girl Showtimes Near Landmark Plaza Frontenac
Craigslist Anc Ak
Bones And All Showtimes Near Emagine Canton
Asisn Massage Near Me
Die 10 wichtigsten Sehenswürdigkeiten in NYC, die Sie kennen sollten
Latest Posts
Article information

Author: The Hon. Margery Christiansen

Last Updated:

Views: 5844

Rating: 5 / 5 (70 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: The Hon. Margery Christiansen

Birthday: 2000-07-07

Address: 5050 Breitenberg Knoll, New Robert, MI 45409

Phone: +2556892639372

Job: Investor Mining Engineer

Hobby: Sketching, Cosplaying, Glassblowing, Genealogy, Crocheting, Archery, Skateboarding

Introduction: My name is The Hon. Margery Christiansen, I am a bright, adorable, precious, inexpensive, gorgeous, comfortable, happy person who loves writing and wants to share my knowledge and understanding with you.