A Data Leakage mistake often made while using GridSearchCV / RandomizedSearchCV

Use of sci-kit learn pipeline allow us to fit only the train split while cross-validation thus avoiding data leakage while hyper-parameter tuning.

We all know about the importance of creating a separate train and test set to avoid data leakage. We use the statistics of the train data only to transform our data i.e. in the language of sci-kit learn for any data transformation we ‘fit’ only on train data and not on the test/validation data.

e.g.: if we want to standardize a given feature we calculate the mean and standard deviation of the train data only and use them to standardize the test/validation data.

I have noticed that people after doing the above transformation directly pass the transformed dataset to the GridSearchCV or RandomizedSearchCV which internally performs similar train-test(validation) cross-validation splits of the transformed data and it has no mechanism to calculate the statistic of the train split only and leave the test/validation split out.

i.e. ideally we should ‘fit_transform’ each training split and ‘transform’ the validation split for each iteration of the Cross Validation run.

We should use sci-kit learn pipeline along with GridSearchCV or RandomizedSearchCV (or any other Cross Validation) to avoid data leakage.

Instead of passing the estimator(model) we pass a pipeline which has the details of data transformations required along with the estimator/model to use. So we can avoid data leakage for each cross-validation step.

Consider a example without Pipeline:

Here we are using the make_column_transformer class to transform any data. You can also use ColumnTransformer. I am not using any dataset here as i just want to just explain the utility of the pipeline. You can refer to the model training notebook in this github repository for usage with other modelling functions.

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer

# using make_column_transformer here that would transform any input data as
# per the given details like the first feature (given by index 0) will be KBinsDiscretize etc.

transformation_required = make_column_transformer(
    (KBinsDiscretizer(n_bins=4,strategy='quantile') , [0] ),
    (OneHotEncoder(handle_unknown='ignore') , [1]),
    (ResponseEncoding(), (2) ),
    (PowerTransformer(method='yeo-johnson',standardize=True) , (3,4,5))
    

 Like i have noticed in most code, the X_train is first transformed and then used inside GridsearchCV as shown below:

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

# usually i have seen X_train being fully transformed before any Cross Validation
# so every CV split will have data leakage

X_train = transformation_required.fit_transform(X_train.values,y_train)

# example gridsearchcv code
 
model = SGDClassifier(loss='log_loss', random_state=92)
params = {'sgdclassifier__alpha':[0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1]} #hyperparameters

clf = GridSearchCV( model, params, cv=5, return_train_score=True, refit=False,
                    scoring={'recall','precision','f1'}, verbose=4)

clf.fit(X_train,y_train)

Here we are using a simple example of using the SGDClassifier. To fit the GridSearchCV we have passed the Training data only(X_train,y_train) which is transformed beforehand and the GridSearchCV has on information passed on what transformations are required as it is going to perform its own multiple splits and arrange them as train-test splits, each requiring its separate transformation to avoid data leakage.

Let’s look at an example with Pipeline:

Here we are using the make_pipeline class. You can also use the Pipeline class.

from sklearn.pipeline import make_pipeline

# creating pipeline that will first transform any input data and then apply the estimator.
pipe_lr = make_pipeline(
                        transformation_required ,
                        SGDClassifier(loss='log_loss', random_state=92)
                       )

params = {'sgdclassifier__alpha':[0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1]} #hyperparameters

# here passing the pipeline as estimator so any cross-validation will pass the
# data to the pipeline which will fit_transform the train split only and avoid
# data leakage

clf = GridSearchCV( pipe_lr, params, cv=5, return_train_score=True, refit=False,
                    scoring={'recall','precision','f1'}, verbose=4)

# below X_train is the raw data (without any transformation) as pipeline will
# take care of that for each CV split

clf.fit(X_train,y_train) 
 
You can refer to the model training notebook in this github repo for usage of pipeline and for other hyper-parameter tuning technique like BayesSearchCV.

Thank you for reading!!

Comments

Popular posts from this blog

vLLM Parameter Tuning for Better Performance

LLM Web Scraping - Webpage to LLM Friendly Text - Fully Open Source

Understanding LSTM