class: center, middle # Scikit-learn and tabular data: closing the gap EuroScipy 2018 Joris Van den Bossche https://github.com/jorisvandenbossche/talks/ .affiliations[ ![:scale 65%](img/logoUPSayPlusCDS_990.png) ![:scale 25%](img/inria-logo.png) ] --- ## About me Joris Van den Bossche - Background: PhD bio-science engineer, air quality research - Open source enthusiast: pandas core dev, geopandas maintainer, scikit-learn contributor - Currently working at the Université Paris-Saclay Center for Data Science (Inria) https://github.com/jorisvandenbossche Twitter: [@jorisvdbossche](https://twitter.com/jorisvdbossche)
.affiliations[ ![:scale 65%](img/logoUPSayPlusCDS_990.png) ![:scale 25%](img/inria-logo.png) ] --- class: center, middle
*Mind the Gap! Bridging the pandas – scikit-learn dtype divide* talk by Tom Augspurger (PyData Chicago, 2016) ??? Title of my talk is based on this talk that Tom Augspurger (pandas, dask-ml developer) gave two years ago on the difficulty of combining scikit-learn and pandas: the gap between both libraries. Scikit-learn traditionally centered its data model around numpy arrays. However, in an important subset of scikit-learn's use cases, the original data in the machine learning pipeline is tabular: heterogeneously typed and labeled. In the meantime, pandas has become very popular, and increasingly used to represent such tabular data, but scikit-learn does not always play well with heterogeneous DataFrames. We have been working on making this gap smaller --- class: center
.center[
] ## scikit-learn: Machine learning in Python -- count: false - Popular and widely used package - Great documentation - Simple and elegant API ??? - 28k stars on GitHub - 1000+ contributors - 400k/month unique visits on http://scikit-learn.org/ ## scikit-learn: API principles - Consistency - Inspection - Non-proliferation of classes - Composition - Sensible defaults (from http://tomdlt.github.io/decks/2018_pydata/#1) --- ## Estimator ```python from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() y_pred = model.`fit`(X_train, y_train).`predict`(X_test) ``` --- count: false ## Estimator ```python from sklearn.ensemble import `RandomForestClassifier` model = `RandomForestClassifier`() y_pred = model.fit(X_train, y_train).predict(X_test) ``` --- ## Other estimator ```python from sklearn.linear_model import `LogisticRegression` model = `LogisticRegression`(C=10) y_pred = model.fit(X_train, y_train).predict(X_test) ``` --- ## Transformer ```python from sklearn.preprocessing import `StandardScaler` # transformer trans = `StandardScaler`() X_train_2 = trans.`fit`(X_train).`transform`(X_train) X_test_2 = trans.`transform`(X_test) ``` --- ## Transformer + Classifier ```python from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # transformer trans = StandardScaler() X_train_2 = trans.fit(X_train).transform(X_train) X_test_2 = trans.transform(X_test) # classifier model = LogisticRegression(C=10) y_pred = model.fit(X_train_2, y_train).predict(X_test_2) ``` --- count: false ## Transformer + Classifier ```python from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # transformer trans = StandardScaler() X_train_2 = trans.`fit`(X_train).`transform`(X_train) X_test_2 = trans.`transform`(X_test) # classifier model = LogisticRegression(C=10) y_pred = model.`fit`(X_train_2, y_train).predict(X_test_2) ``` --- ## Pipeline ! ```python from sklearn.pipeline import `make_pipeline` from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # transformer + classifier model = `make_pipeline`(StandardScaler(), LogisticRegression(C=10)) y_pred = model.`fit`(X_train, y_train).predict(X_test) ``` --- count: false ## Pipeline ! ```python from sklearn.pipeline import `make_pipeline` from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # transformer + classifier model = `make_pipeline`(StandardScaler(), LogisticRegression(C=10)) y_pred = model.`fit`(X_train, y_train).predict(X_test) ``` Pipelines: chaining different preprocessing steps (scaling, feature extraction or selection, ...) and estimator: - Convenience + robust pipelines - Avoid leaking information (cross-validation) - Joint parameter selection ??? convenience and robust pipelines: - sklearn has the concept of "transformers" - Fit-transform paradigm ensures train and test-set categories correspond. - with pipeline: easier separation of train test data - not leaking statistics - having everything part of a single pipeline, easier to put into production (when using transform + predict, and not fit! with manual code there might be changes in the data that doesn't work with your model, eg unseen categories can be ignored vs get_dummies cannot do this) --- class: center, middle # ... and now in practice with tabular data --- ## "Data science hello world" example Titanic: classification with simple heterogeneous table ```python df = pd.read_csv("titanic3.csv") df.head() ```
pclass
survived
sex
age
sibsp
parch
fare
embarked
0
1
1
female
29.0000
0
0
211.3375
S
1
1
1
male
0.9167
1
2
151.5500
S
...
...
...
...
...
...
...
...
...
1307
3
0
male
27.0000
0
0
7.2250
C
1308
3
0
male
29.0000
0
0
7.8750
S
```python y = df['survived'] X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) ``` ??? - "Hello world" example: - some numerical columns -> scale - some categorical columns -> one-hot encode - using the titanic example here for the rest of the talk. --- ## "Data science hello world" example ```python from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_test) ``` .small[ ``` Traceback ...> ValueError: could not convert string to float: 'S' ``` ] ??? this is to be expected, because actual models need numerical data -> we need to encode -- count: false Need encoding: ```python from sklearn.preprocessing import OneHotEncoder ohe = OneHotEncoder() ohe.fit_transform(X_train[['sex', 'embarked']]) ``` .small[ ``` Traceback ...> ValueError: could not convert string to float: 'S' ``` ] --- ## "Data science hello world" example Workarounds: ```python from sklearn.preprocessing import LabelEncoder, OneHotEncoder le = LabelEncoder() X['embarked'] = le.fit_transform(X['embarked']) ohe = OneHotEncoder() ohe.fit_transform(X['embarked']) ``` ```python X['embarked'].fillna('missing', inplace=True) X_encoded = pd.get_dummies(X) X_encoded ```
pclass
age
sibsp
parch
fare
sex_female
sex_male
embarked_C
embarked_Q
embarked_S
0
1
29.0000
0
0
211.3375
1
0
0
0
1
1
1
0.9167
1
2
151.5500
0
1
0
0
1
...
...
...
...
...
...
...
...
...
...
...
=> Workarounds prevent correct use of pipelines ??? and this is a problem, for all the reasons I have been giving on why you should use pipelines --- ## Problems encountered with tabular data With scikit-learn 0.19: * We cannot fully use scikit-learn pipelines * Scikit-learn cannot handle categorical variables * No easy way to apply different transformations on numerical / categorical columns * Loss of index/columns information of the DataFrame * Just not easy ??? summary of painpoints But we have been working on improving this in the upcoming sklearn 0.20 release --- class: center, middle # New developments (0.20) and contrib packages --- ## Encoding categorical features Converting discrete features to (several) numerical features For example, one-hot or dummy encoding: .center[ ![:scale 70%](img/onehot.svg) ] --- ## OneHotEncoder and OrdinalEncoder `OneHotEncoder` was improved to handle categorical string features ([PR 9151](https://github.com/scikit-learn/scikit-learn/pull/9151)) ```python from sklearn.preprocessing import OneHotEncoder ohe = OneHotEncoder() ohe.fit_transform(X_train[['pclass', 'sex', 'embarked']]) ``` .small[ ``` array([[ 0., 0., 1., ..., 1., 0., 0.], [ 0., 0., 1., ..., 0., 1., 0.], [ 1., 0., 0., ..., 1., 0., 0.], ..., [ 0., 0., 1., ..., 0., 0., 1.], [ 0., 1., 0., ..., 0., 0., 1.], [ 1., 0., 0., ..., 1., 0., 0.]]) ``` ] http://scikit-learn.org/dev/modules/preprocessing.html#encoding-categorical-features ??? still returns --- ## Category Encoders contrib package ```python from `category_encoders` import OneHotEncoder ohe = OneHotEncoder(handle_unknown='ignore') ohe.fit_transform(X_train[['pclass', 'sex', 'embarked']]) ```
sex_1
sex_2
embarked_1
embarked_2
embarked_3
pclass
927
1
0
1
0
0
3
703
1
0
0
1
0
3
...
...
...
...
...
...
...
384
1
0
0
0
1
2
169
0
1
1
0
0
1
Want more categorical encoders? http://contrib.scikit-learn.org/categorical-encoding/ ??? there are many more possible ways to convert your categorical variables into numeric features suited to feed into models. The Category Encoders is a scikit-learn-contrib package that provides a whole suite of scikit-learn compatible transformers for different types of categorical encodings. (binary encoder, target encoder, ...) And still => want to only apply this on a subset of the columns => now want to put this in a pipeline --- ## ColumnTransformer `ColumnTransformer`: applying different transformations to different features in a scikit-learn pipeline ([PR 9012](https://github.com/scikit-learn/scikit-learn/pull/9012)) ```python make_column_transformer( (column_selection, transformer), (column_selection, transformer), ... ) ``` --- count: false ## ColumnTransformer `ColumnTransformer`: applying different transformations to different features in a scikit-learn pipeline ([PR 9012](https://github.com/scikit-learn/scikit-learn/pull/9012)) ```python make_column_transformer( (`column_selection`, transformer), (`column_selection`, transformer), ... ) ``` Selecting columns: * List of column names (eg `['col1', 'col2']`) * Boolean mask (eg `df.dtypes == object`) * Function that returns one of those ??? function is particilarly useful for automated pipelines like automl (where you want to do the selection depending on the input data) --- count: false ## ColumnTransformer `ColumnTransformer`: applying different transformations to different features in a scikit-learn pipeline ([PR 9012](https://github.com/scikit-learn/scikit-learn/pull/9012)) ```python make_column_transformer( (column_selection, `transformer`), (column_selection, `transformer`), ... ) ``` Any transformer (or pipeline of transformers) passed the subset of the data --- count: false ## ColumnTransformer `ColumnTransformer`: applying different transformations to different features in a scikit-learn pipeline ([PR 9012](https://github.com/scikit-learn/scikit-learn/pull/9012)) ```python make_column_transformer( (column_selection, transformer), (column_selection, transformer), ... ) ``` Fully integrated in scikit-learn (cross-validation, parameter optimization, ..) --- ## ColumnTransformer: titanic example
```python from sklearn.pipeline import make_pipeline from sklearn.compose import make_column_transformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder numerical_features = ['age', 'fare'] numerical_transformer = make_pipeline( SimpleImputer(strategy='median'), StandardScaler()) categorical_features = ['pclass', 'sex', 'embarked'] categorical_transformer = make_pipeline( SimpleImputer(strategy='constant'), OneHotEncoder()) preprocessor = make_column_transformer( (numerical_featues, numerical_transformer), (categorical_features, categorical_transformer)) ``` Simple imputer support for categorical features: [PR 11211](https://github.com/scikit-learn/scikit-learn/pull/11211) ??? fully integrated in scikit-learn eg parameter optimization with grid search (for all steps in the pipeline) --- count: false ## ColumnTransformer: titanic example
```python from sklearn.pipeline import make_pipeline from sklearn.compose import make_column_transformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder *numerical_features = ['age', 'fare'] numerical_transformer = make_pipeline( SimpleImputer(strategy='median'), StandardScaler()) *categorical_features = ['pclass', 'sex', 'embarked'] categorical_transformer = make_pipeline( SimpleImputer(strategy='constant'), OneHotEncoder()) preprocessor = make_column_transformer( (numerical_featues, numerical_transformer), (categorical_features, categorical_transformer)) ``` Simple imputer support for categorical features: [PR 11211](https://github.com/scikit-learn/scikit-learn/pull/11211) --- count: false ## ColumnTransformer: titanic example
```python from sklearn.pipeline import make_pipeline from sklearn.compose import make_column_transformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder numerical_features = ['age', 'fare'] *numerical_transformer = make_pipeline( * SimpleImputer(strategy='median'), * StandardScaler()) categorical_features = ['pclass', 'sex', 'embarked'] categorical_transformer = make_pipeline( SimpleImputer(strategy='constant'), OneHotEncoder()) preprocessor = make_column_transformer( (numerical_featues, numerical_transformer), (categorical_features, categorical_transformer)) ``` Simple imputer support for categorical features: [PR 11211](https://github.com/scikit-learn/scikit-learn/pull/11211) --- count: false ## ColumnTransformer: titanic example
```python from sklearn.pipeline import make_pipeline from sklearn.compose import make_column_transformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder numerical_features = ['age', 'fare'] numerical_transformer = make_pipeline( SimpleImputer(strategy='median'), StandardScaler()) categorical_features = ['pclass', 'sex', 'embarked'] *categorical_transformer = make_pipeline( * SimpleImputer(strategy='constant'), * OneHotEncoder()) preprocessor = make_column_transformer( (numerical_featues, numerical_transformer), (categorical_features, categorical_transformer)) ``` Simple imputer support for categorical features: [PR 11211](https://github.com/scikit-learn/scikit-learn/pull/11211) --- count: false ## ColumnTransformer: titanic example
```python from sklearn.pipeline import make_pipeline from sklearn.compose import make_column_transformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder numerical_features = ['age', 'fare'] numerical_transformer = make_pipeline( SimpleImputer(strategy='median'), StandardScaler()) categorical_features = ['pclass', 'sex', 'embarked'] categorical_transformer = make_pipeline( SimpleImputer(strategy='constant'), OneHotEncoder()) *preprocessor = make_column_transformer( * (numerical_featues, numerical_transformer), * (categorical_features, categorical_transformer)) ``` Simple imputer support for categorical features: [PR 11211](https://github.com/scikit-learn/scikit-learn/pull/11211) ??? this is not yet a full pipeline (for that need to combine with final estimator in another pipeline), but we can already do our preprocessing with this --- ## ColumnTransformer: titanic example ```python preprocessor.fit_transform(X_train) ``` .small[ ``` array([[-0.395, 0.091, 0., 1., 0., 1., 0., 1., 0., 0., 0., 1.], [-0.086, -0.489, 0., 0., 1., 1., 0., 0., 1., 0., 0., 0.], [ 1.226, -0.113, 1., 0., 0., 0., 1., 0., 0., 1., 0., 0.], [-0.472, -0.434, 0., 1., 0., 0., 1., 0., 0., 1., 0., 0.], [ 0.917, -0.344, 0., 1., 0., 0., 1., 1., 0., 0., 0., 0.]] ... ) ``` ] -- count: false Shout-out to https://github.com/scikit-learn-contrib/sklearn-pandas that already had a similar feature: `DataFrameMapper` (with the ability to return a DataFrame as output) --- ## Full pipeline ```python model = make_pipeline(preprocessor, LogisticRegression()) model.fit(X_train, y_train) model.predict(X_test) ``` .small[ ``` array([0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, ...]) ``` ] ```python model.named_steps['logisticregression'].coef_ ``` .small[ ``` array([[-0.390, 0.061, 0.866, -0.015 , -0.771 , 1.258, -1.179, 0.331, -0.068, -0.183, -0.297]]) ``` ] ??? This is great. But, as noted before, we now lost information about feature names. And especially in combintion with the ColumnTransformer (which can reorder the original columns ) it can be hard to see which coefficient is corresponding to which feature one way is to have an improved get_feature_names --- ## Ibex: pandas adapters for scikit-learn Transforms estimators to pandas-aware estimators https://atavory.github.io/ibex/ ```python from `ibex.`sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train_encoded, y_train) ``` .small[ ``` Adapter[LogisticRegression](C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) ``` ] --- ## Ibex: pandas adapters for scikit-learn ```python from `ibex.`sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train_encoded, y_train) ``` ```python model.coef_ ```
pclass
age
sibsp
fare
sex_female
sex_male
embarked_C
embarked_Q
embarked_S
0
-0.822002
-0.45996
-0.284547
0.016556
0.582653
-0.582653
0.179525
-0.044348
-0.127387
```python model.predict(X_test_encoded) ``` .small[ ``` 1095 1 1023 1 .. 1254 0 397 0 Length: 328, dtype: int64 ``` ] .center[ .small[https://atavory.github.io/ibex/] ] --- class: center, middle # Wrap-up --- ## "Data science hello world" example ```python categorical = X.dtype == object numerical_transformer = make_pipeline( SimpleImputer(strategy='median'), StandardScaler()) categorical_transformer = make_pipeline( SimpleImputer(strategy='constant'), OneHotEncoder()) preprocessor = make_column_transformer( (~categorical, numerical_transformer), (categorical, categorical_transformer)) model = make_pipeline(preprocessor, LogisticRegression()) ``` -- count: false Pipelines with heterogeneous data: *possible* to construct them explicitly --- ## Scikit-learn with pandas DataFrames In scikit-learn 0.20: * (Basic) support for categorical features (encoder, imputers) ![:scale 5%](img/ok.jpg) * Apply different transformers to different columns (ColumnTransformer) ![:scale 5%](img/ok.jpg) To be further discussed .xsmall[(in decreasing order of likelihood)]: * Feature names based on column names (`preprocessor.get_feature_names`) ? * DataFrame in / DataFrame out for transformers ? * Use data type information ? * Preserve index/columns information in predictors ? * Automatic definition of categorical columns ? ??? Improved: * OneHotEncoder and SimpleImputer support for categorical variables * ColumnTransformer Making pandas first class citizen? --- class: middle ## Thanks to the scikit-learn community! # Questions? Feedback? Those slides: [jorisvandenbossche.github.io/talks/2018_Scipy_sklearn_pandas]( http://jorisvandenbossche.github.io/talks/2018_Scipy_sklearn_pandas) ??? Thanks to: - Paris-Saclay Center for Data Science (Inria) - sklearn community