Willi Müller
Willi Müller

Reputation: 651

Pandas: Group by key of dict in column which contains dictionaries

My data

I have the following pandas data frame:

df = pd.DataFrame({
    'c1': range(5),
    'c2': [
        {'k1': 'x-1', 'k2': 'z'}, 
        {'k1': 'x-2', 'k2': 'z1'},
        {'k1': 'x-3', 'k2': 'z1'},
        {'k1': 'y-1', 'k2': 'z'},
        {'k1': 'y-2', 'k2': 'z1'}
    ]
})

My aim

Now, I would like to group by 'k1', which is a common key in all rows of column 'c2', which contains dictionaries. The grouping function would be lambda x: x.split('-')[0] to cut off the numbers following the dash.

The desired output is:

'x'     3
'y'     2   

Attempts

>>> df.groupby(df['c2']['k1'].str.split('-')[0]).count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)
  File "/Library/Python/2.7/site-packages/pandas/core/indexes/base.py", line 2477, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)
  File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)
  File "pandas/_libs/index.pyx", line 156, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5210)
KeyError: 'k1'

Apparently, I cannot index the key k1 of the row c2 by df['c2']['k1'].

How could I do this?

Upvotes: 1

Views: 1777

Answers (1)

jezrael
jezrael

Reputation: 862511

You are close, only need convert column with dicts to new DataFrame:

print (pd.DataFrame(df['c2'].values.tolist()))
    k1  k2
0  x-1   z
1  x-2  z1
2  x-3  z1
3  y-1   z
4  y-2  z1

a = pd.DataFrame(df['c2'].values.tolist())['k1'].str.split('-').str[0]
print (a)
0    x
1    x
2    x
3    y
4    y
Name: k1, dtype: object

df = df.groupby(a).size().reset_index(name='len')
print (df)
  k1  len
0  x    3
1  y    2

Another solution is use list comprehension for groupby keys:

L = [x['k1'].split('-')[0] for x in df['c2']]
print (L)
['x', 'x', 'x', 'y', 'y']

df = df.groupby(L).size().rename_axis('k1').reset_index(name='len')
print (df)
  k1  len
0  x    3
1  y    2

What is the difference between size and count in pandas?

Solutions with value_counts:

df = a.value_counts().rename_axis('k1').reset_index(name='len')
print (df)
  k1  len
0  x    3
1  y    2

df = pd.Series(L).value_counts().rename_axis('k1').reset_index(name='len')
print (df)
  k1  len
0  x    3
1  y    2

Upvotes: 2

Related Questions