bff.concat_with_categories

bff.concat_with_categories(df_left, df_right, **kwargs)

Concatenation of Pandas DataFrame having categorical columns.

With the concat function from Pandas, when merging two DataFrames having categorical columns, categories not present in both DataFrames and with the same code are lost. Columns are cast to object, which takes more memory.

In this function, a union of categorical values from both DataFrames is done and both DataFrames are recategorized with the complete list of categorical values before the concatenation. This way, the category field is preserved.

Original DataFrame are copied, hence preserved.

Parameters
  • df_left (pd.DataFrame) – Left DataFrame to merge.

  • df_right (pd.DataFrame) – Right DataFrame to merge.

  • **kwargs – Additional keyword arguments to be passed to the pd.concat function.

Returns

Concatenation of both DataFrames.

Return type

pd.DataFrame

Examples

>>> import pandas as pd
>>> column_types = {'name': 'object',
...                 'color': 'category',
...                 'country': 'category'}
>>> columns = list(column_types.keys())
>>> df_left = pd.DataFrame([['John', 'red', 'China'],
...                         ['Jane', 'blue', 'Switzerland']],
...                        columns=columns).astype(column_types)
>>> df_right = pd.DataFrame([['Mary', 'yellow', 'France'],
...                          ['Fred', 'blue', 'Italy']],
...                         columns=columns).astype(column_types)
>>> df_left
   name color      country
0  John   red        China
1  Jane  blue  Switzerland
>>> df_left.dtypes
name         object
color      category
country    category
dtype: object

The following concatenation shows the issue when using the concat function from pandas:

>>> res_fail = pd.concat([df_left, df_right], ignore_index=True)
>>> res_fail
   name   color      country
0  John     red        China
1  Jane    blue  Switzerland
2  Mary  yellow       France
3  Fred    blue       Italy
>>> res_fail.dtypes
name       object
color      object
country    object
dtype: object

All types are back to object since not all categorical values were present in both DataFrames.

With this custom implementation, the categorical type is preserved:

>>> res_ok = concat_with_categories(df_left, df_right, ignore_index=True)
>>> res_ok
   name   color      country
0  John     red        China
1  Jane    blue  Switzerland
2  Mary  yellow       France
3  Fred    blue       Italy
>>> res_ok.dtypes
name         object
color      category
country    category
dtype: object