Reputation: 2024
I am attempting to run a pandas groupby on my DataFrame.
import pandas as pd
# Groupby
cols = ['col1','col2']
df = df.groupby(cols, as_index=False).agg({'r': lambda x: list(x)})
Some columns may have dicts
as values. I attempt to find them and convert to string like this:
# Find columns containing dicts
col_dict = [i for i in df.columns if isinstance(dfd[i][0],dict)]
col_dict
# Dict to string
df[col_dict] = df[col_dict].astype(str)
How do I groupby a dataframe with columns containing dicts
data type?
Upvotes: 0
Views: 2656
Reputation: 2374
Given the title and the actual question you asked, I assume you are doing something like the following.
import pandas as pd
df = pd.DataFrame(
{
"col1": [{"k": 1, "k2": 2}, {"k": 3, "k2": 4}, {"k": 1, "k2": 2}, {"k": 3, "k2": 4}],
"col2": [42, 88, 86, 75],
"r": [3.1415, 2.71, 1.121, 0.333],
"cat": ["a", "a", "b", "b"]
}
)
df = df.groupby(by=["col1"], as_index=False).agg({'r': lambda x: list(x)})
which for me gives the error you mention, TypeError: unhashable type: 'dict'
.
One way you could do this, as you've started to already, is to convert the dictionaries to strings. Here is how this could be done.
df = df.assign(
group=lambda DF: DF.col1.astype(str)
).groupby(
by=["group"],
as_index=False
).agg(
{'r': lambda x: list(x)}
)
print(df)
which produces
group r
0 {'k': 1, 'k2': 2} [3.1415, 1.121]
1 {'k': 3, 'k2': 4} [2.71, 0.333]
As to why grouping by the dictionaries directly raises the error, it seems that under the hood Pandas is using a hash table to create the group keys. Therefore, the values that are used for the group, must be hashable, which dictionaries are not. Converting them to strings would make them hashable. Another way would be to convert to a frozenset of (key, value)
pairs, which matches the fact that dictionaries are equal if they have the same set of (key, value)
pairs, regardless of order (see the docs).
This actually raises an interesting "gotcha". If your dictionaries happen to have the same (key, value)
pairs, but in different orders, then converting them to strings may not work as expected. For example,
df = pd.DataFrame(
{
"col1": [{"k": 1, "k2": 2}, {"k": 3, "k2": 4}, {"k2": 2, "k": 1}, {"k": 3, "k2": 4}],
"col2": [42, 88, 86, 75],
"r": [3.1415, 2.71, 1.121, 0.333],
"cat": ["a", "a", "b", "b"]
}
)
df = df.assign(
group=lambda DF: DF.col1.astype(str)
).groupby(
by=["group"],
as_index=False
).agg(
{'r': lambda x: list(x)}
)
print(df)
produces
group r
0 {'k': 1, 'k2': 2} [3.1415]
1 {'k': 3, 'k2': 4} [2.71, 0.333]
2 {'k2': 2, 'k': 1} [1.121]
Notice here that the dictionaries {'k': 1, 'k2': 2}
and {'k2': 2, 'k1'}
have the same (key, value)
pairs but are considered different groups, even though
d1 = {"k": 1, "k2": 2}
d2 = {"k2": 2, "k": 1}
print(d1 == d2)
prints True
A more robust way to group by dictionaries that would create groups when the dictionaries are equal would be
df = df.assign(
group=lambda DF: DF.col1.apply(lambda x: frozenset(x.items()))
).groupby(
by=["group"],
as_index=False
).agg(
{'r': lambda x: list(x)}
)
Upvotes: 3