bff.concat_with_categories¶
-
bff.
concat_with_categories
(df_left, df_right, **kwargs)¶ Concatenation of Pandas DataFrame having categorical columns.
With the concat function from Pandas, when merging two DataFrames having categorical columns, categories not present in both DataFrames and with the same code are lost. Columns are cast to object, which takes more memory.
In this function, a union of categorical values from both DataFrames is done and both DataFrames are recategorized with the complete list of categorical values before the concatenation. This way, the category field is preserved.
Original DataFrame are copied, hence preserved.
- Parameters
df_left (pd.DataFrame) – Left DataFrame to merge.
df_right (pd.DataFrame) – Right DataFrame to merge.
**kwargs – Additional keyword arguments to be passed to the pd.concat function.
- Returns
Concatenation of both DataFrames.
- Return type
pd.DataFrame
Examples
>>> import pandas as pd >>> column_types = {'name': 'object', ... 'color': 'category', ... 'country': 'category'} >>> columns = list(column_types.keys()) >>> df_left = pd.DataFrame([['John', 'red', 'China'], ... ['Jane', 'blue', 'Switzerland']], ... columns=columns).astype(column_types) >>> df_right = pd.DataFrame([['Mary', 'yellow', 'France'], ... ['Fred', 'blue', 'Italy']], ... columns=columns).astype(column_types) >>> df_left name color country 0 John red China 1 Jane blue Switzerland >>> df_left.dtypes name object color category country category dtype: object
The following concatenation shows the issue when using the concat function from pandas:
>>> res_fail = pd.concat([df_left, df_right], ignore_index=True) >>> res_fail name color country 0 John red China 1 Jane blue Switzerland 2 Mary yellow France 3 Fred blue Italy >>> res_fail.dtypes name object color object country object dtype: object
All types are back to object since not all categorical values were present in both DataFrames.
With this custom implementation, the categorical type is preserved:
>>> res_ok = concat_with_categories(df_left, df_right, ignore_index=True) >>> res_ok name color country 0 John red China 1 Jane blue Switzerland 2 Mary yellow France 3 Fred blue Italy >>> res_ok.dtypes name object color category country category dtype: object