A new categorical encoder for handling categorical features in scikit-learn
This work is supported by the Université Paris-Saclay Center for Data Science
Short summary: an improved categorical encoder has landed in scikit-learn (the PR has been merged in master and this will be included in the next release).
Update: the original version of this post mentioned the CategoricalEncoder
object, but this was renamed to OneHotEncoder
/ OrdinalEncoder
before the final release of scikit-learn 0.20.
When working with prediction problems, in many cases your dataset will contain categorical variables. These are non-numeric variables -- or if numeric, the values should not be interpreted as numeric values -- that typically consist of a limited number of unique values (the categories or the levels). On the other hand, most machine learning models require numeric input data. Therefore, categorical variables are encoded: they are converted to one or multiple numeric features. A well known example is one-hot or dummy encoding.
Currently there is no good out-of-the-box solution in scikit-learn. There is the OneHotEncoder
which provides one-hot encoding, but because it only works on integer columns and has a bit of an awkward API, it is rather limited in practice.
Chris Mofitt recently wrote a nice guide on how to encode categorical variables in python (see his blogpost). He shows different ways to solve this: by (mis)using the LabelEncoder
(which is actually meant for the target variable, not for encoding features) or using pandas' get_dummies
, etc.
But none of these solutions are ideal for the simple cases or can readily be integrated in scikit-learn pipelines.
The newly added categorical encoding options try to solve this: provide a built-in way to encode your categorical variables with some common options (either a one-hot or dummy encoding with the improved OneHotEncoder
or an ordinal encoding with the OrdinalEncoder
).
Example¶
To illustrate the basic usage of this new transformer, let's load the titanic survival dataset:
titanic = pd.read_csv("https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/master/notebooks/datasets/titanic3.csv")
titanic = titanic[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'survived']].dropna()
titanic.head()
This dataset contains two categorical variables ("sex" and "embarked"). Now we can use the OneHotEncoder
to transform those two columns into one-hot encoded or dummy columns (the "sex" feature results in 2 dummy columns for female/male, the "embarked" feature in 3 columns, which together gives the resulting transformed array with 5 columns):
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoder.fit_transform(titanic[['sex', 'embarked']])
encoder.categories_
See the development docs for more information.
Having this conversion available as a sklearn transformer also makes it easier to put in a Pipeline
. Although, this is at the moment not yet fully straightforward because we need to combine the output of this categorical encoder with the other numeric columns. Currently you can already use the FeatureUnion or the DataFrameMapper from the sklearn-pandas project, but the future ColumnTransformer
will provide a built-in way to make this much easier (this is another PR I am working on: https://github.com/scikit-learn/scikit-learn/pull/9012).
! This is brand new functionality in scikit-learn, so feedback is very welcome! (the PR)
Want more categorical encoders?¶
The OneHotEncoder
and OrdinalEncoder
only provide two ways to encode, but there are many more possible ways to convert your categorical variables into numeric features suited to feed into models. The Category Encoders is a scikit-learn-contrib package that provides a whole suite of scikit-learn compatible transformers for different types of categorical encodings.