Default / 默认 · September 1, 2021


Table of Content


One-Hot, Label, Target, Frequency, and Embedding Encoders for Categorical Features
import pandas as pd
from kaggler.preprocessing import OneHotEncoder, LabelEncoder, TargetEncoder, FrequencyEncoder, EmbeddingEncoder
trn = pd.read_csv(‘train.csv’)
target_col = trn.columns[-1]
cat_cols = [col for col in trn.columns if trn[col].dtype == ‘object’]

ohe = OneHotEncoder(min_obs=100) # grouping all categories with less than 100 occurences
lbe = LabelEncoder(min_obs=100) # grouping all categories with less than 100 occurences
te = TargetEncoder() # replacing each category with the average target value of the category
fe = FrequencyEncoder() # replacing each category with the frequency value of the category
ee = EmbeddingEncoder() # mapping each category to a vector of real numbers

X_ohe = ohe.fit_transform(trn[cat_cols]) # X_ohe is a scipy sparse matrix
trn[cat_cols] = lbe.fit_transform(trn[cat_cols])
trn[cat_cols] = te.fit_transform(trn[cat_cols])
trn[cat_cols] = fe.fit_transform(trn[cat_cols])
X_ee = ee.fit_transform(trn[cat_cols], trn[target_col]) # X_ee is a numpy matrix

tst = pd.read_csv(‘test.csv’)
X_ohe = ohe.transform(tst[cat_cols])
tst[cat_cols] = lbe.transform(tst[cat_cols])
tst[cat_cols] = te.transform(tst[cat_cols])
tst[cat_cols] = fe.transform(tst[cat_cols])
X_ee = ee.transform(tst[cat_cols])


%d bloggers like this: