Ordinal Encoder

In an academic study on tree based models;

Ordinal encoder scores are on average around 1% to 2% better than one hot encoder scores, During the second training, the ordinal encoder works 2.5 times faster than the one hot encoder,

It has been observed that an Ordinal encoder should be used to obtain correct feature importance.

In tree based models, if our aim is only to get scores, we can use both one hot encoder and ordinal encoder.

However, we will use ordinal encoder in tree based models due to the advantages mentioned above.

We said that if your goal is only to get scores, we can use ordinal encoder or one hot encoder. However, we do not convert observations into dummies (one hot encoder) by using the get_dummies function. Because incorrect use of the get_dummies function may cause data leakage problems. We remember that we should always use one hot encoder instead of get_dummies.

The get_dummies function converts the categorical variables in the data set into binary representation, that is, it performs one-hot encoding. However, if not used correctly during model training, the get_dummies function can lead to data leakage.

**Data leakage is the situation where information is leaked from the test data set during the model training phase, and this can make the actual performance of the model misleadingly high. There are several ways data leakage can occur with get_dummies:

Converting the Training and Test Set at the Same Time: If the training and test data sets are combined and get_dummies is applied on this combined set, the model will “learn” the information of the categorical values in the test set during the training phase. This prevents you from accurately predicting the model’s performance on new data it will encounter in a real-world scenario.

Categorical Values Not in the Training Data Set: If there are new categorical values in the test set that are not in the training set and these values are transformed with get_dummies, the model will not “learn” how to deal with these new values. This may affect the generalization ability of the model.

Tools such as One Hot Encoder are generally designed to avoid these problems because these methods create a dictionary of categorical values in the training set and transform both the training and test set based on this dictionary. In this way, the model is trained only with the information it learns from the training data, and when it encounters new categorical values in the test set, it develops a strategy for these values.

When using get_dummies, it is necessary to take extra precautions to transform the test set independently of the training set and to handle categorical values that are not in the training set. This can often be achieved by separating data sets and correct transformation during the data pre-processing stages.

Comments

Bir yanıt yazın

E-posta adresiniz yayınlanmayacak. Gerekli alanlar * ile işaretlenmişlerdir