A Data Leakage mistake often made while using GridSearchCV / RandomizedSearchCV
Use of sci-kit learn
pipeline allow us to fit only the train split while cross-validation
thus avoiding data leakage while hyper-parameter tuning.
e.g.: if we want to standardize a given feature we calculate the mean and standard deviation of the train data only and use them to standardize the test/validation data.
I have noticed that people after doing the above transformation directly pass the transformed dataset to the GridSearchCV or RandomizedSearchCV which internally performs similar train-test(validation) cross-validation splits of the transformed data and it has no mechanism to calculate the statistic of the train split only and leave the test/validation split out.
i.e. ideally we should ‘fit_transform’ each training split and ‘transform’ the validation split for each iteration of the Cross Validation run.
We should use sci-kit learn pipeline along with GridSearchCV or RandomizedSearchCV (or any other Cross Validation) to avoid data leakage.
Instead of passing the estimator(model) we pass a pipeline which has the details of data transformations required along with the estimator/model to use. So we can avoid data leakage for each cross-validation step.
Consider a example without Pipeline:
Here we are using the make_column_transformer class to transform any data. You can also use ColumnTransformer. I am not using any dataset here as i just want to just explain the utility of the pipeline. You can refer to the model training notebook in this github repository for usage with other modelling functions.
from sklearn.compose import make_column_transformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import PowerTransformer # using make_column_transformer here that would transform any input data as # per the given details like the first feature (given by index 0) will be KBinsDiscretize etc. transformation_required = make_column_transformer( (KBinsDiscretizer(n_bins=4,strategy='quantile') , [0] ), (OneHotEncoder(handle_unknown='ignore') , [1]), (ResponseEncoding(), (2) ), (PowerTransformer(method='yeo-johnson',standardize=True) , (3,4,5))
Like i have noticed in most code, the X_train is first transformed and then used inside GridsearchCV as shown below:
from sklearn.linear_model import SGDClassifier from sklearn.model_selection import GridSearchCV # usually i have seen X_train being fully transformed before any Cross Validation # so every CV split will have data leakage X_train = transformation_required.fit_transform(X_train.values,y_train) # example gridsearchcv code model = SGDClassifier(loss='log_loss', random_state=92) params = {'sgdclassifier__alpha':[0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1]} #hyperparameters clf = GridSearchCV( model, params, cv=5, return_train_score=True, refit=False, scoring={'recall','precision','f1'}, verbose=4) clf.fit(X_train,y_train)
Here we are using a simple example of using the SGDClassifier. To fit the GridSearchCV we have passed the Training data only(X_train,y_train) which is transformed beforehand and the GridSearchCV has on information passed on what transformations are required as it is going to perform its own multiple splits and arrange them as train-test splits, each requiring its separate transformation to avoid data leakage.
Let’s look at an example with Pipeline:
Here we are using the make_pipeline class. You can also use the Pipeline class.
from sklearn.pipeline import make_pipeline # creating pipeline that will first transform any input data and then apply the estimator. pipe_lr = make_pipeline( transformation_required , SGDClassifier(loss='log_loss', random_state=92) ) params = {'sgdclassifier__alpha':[0.0001,0.005,0.001,0.05,0.01,0.1,0.5,1]} #hyperparameters # here passing the pipeline as estimator so any cross-validation will pass the # data to the pipeline which will fit_transform the train split only and avoid # data leakage clf = GridSearchCV( pipe_lr, params, cv=5, return_train_score=True, refit=False, scoring={'recall','precision','f1'}, verbose=4) # below X_train is the raw data (without any transformation) as pipeline will # take care of that for each CV split clf.fit(X_train,y_train)
You can refer to the model training notebook in this github repo for usage of pipeline and for other hyper-parameter tuning technique like BayesSearchCV.Thank you for reading!!

Comments
Post a Comment