Reputation: 21574
I've got a simple dataframe with a column populated by a python dictionary, in the form:
User CLang
111 {u'en': 1}
112 {u'en': 1, u'es': 1}
112 {u'en': 1, u'es': 1}
113 {u'zh': 1, u'ja': 1, u'es': 2}
113 {u'zh': 1, u'ja': 1, u'es': 2}
113 {u'zh': 1, u'ja': 1, u'es': 2}
114 {u'es': 1}
113 {u'zh': 1, u'ja': 1, u'es': 2}
The CLang
column contains the frequency of different values for each user. How may I have access to single keys and values of the CLang
column? For instance I would like to groupby the User
and the most frequent value inside the dictionary, in a form like:
g = df.groupby(['User','CLang')
counting then the number of occurrences for each value:
d = g.size().unstack().fillna(0)
The resulting dataframe would appear as:
DLang en es
User
111 1 0
112 1 1
113 0 4
114 0 1
Upvotes: 3
Views: 11909
Reputation: 16174
This, I guess, would be self-explanatory:
> In [413]: x
Out[413]:
CLang User
0 {u'en': 1} 111
1 {u'en': 1, u'es': 1} 112
2 {u'en': 1, u'es': 1} 112
3 {u'zh': 1, u'ja': 1, u'es': 2} 113
4 {u'zh': 1, u'ja': 1, u'es': 2} 113
5 {u'zh': 1, u'ja': 1, u'es': 2} 113
6 {u'es': 1} 114
7 {u'zh': 1, u'ja': 1, u'es': 2} 113
In [414]: paste
temp = pd.DataFrame(index=x['User'])
for i in x.iterrows():
clan = i[1]['CLang']
user = i[1]['User']
for j in clan.iteritems():
if j[0] not in temp.columns:
temp[j[0]] = pd.np.NaN
temp[j[0]][user] = j[1]
else:
temp[j[0]][user] += j[1]
print temp
## -- End pasted text --
en es zh ja
User
111 1 NaN NaN NaN
112 NaN 2 NaN NaN
112 NaN 2 NaN NaN
113 NaN NaN 4 4
113 NaN NaN 4 4
113 NaN NaN 4 4
114 NaN NaN NaN NaN
113 NaN NaN 4 4
In [415]: temp.reset_index().groupby('User').sum()
Out[415]:
en es zh ja
User
111 1 NaN NaN NaN
112 NaN 4 NaN NaN
113 NaN NaN 16 16
114 NaN NaN NaN NaN
In [416]:
Upvotes: 1
Reputation: 3297
I'm not completely sure I understood correctly what you want your output to be and also I don't think using dict
in pandas.DataFrame
is a very good idea in general.
Reshaping your DataFrame
to something more pandas-like would be better, you would then be able to use pandas
methods to solve this problem.
Anyway, if you really want to do it, here's a (not very elegant) way:
In [1]: import pandas as pd
In [2]: l1 = [111, 112, 112, 113, 113, 113, 114, 113]
In [3]: l2 = [{'en': 1},
{'en': 1, 'es': 1},
{'en': 1, 'es': 1},
{'es': 2, 'ja': 1, 'zh': 1},
{'es': 2, 'ja': 1, 'zh': 1},
{'es': 2, 'ja': 1, 'zh': 1},
{'es': 1},
{'es': 2, 'ja': 1, 'zh': 1}]
In [4]: df = pd.DataFrame({'User': l1, 'CLang': l2})
In [5]: df
Out[5]:
User CLang
0 111 {u'en': 1}
1 112 {u'en': 1, u'es': 1}
2 112 {u'en': 1, u'es': 1}
3 113 {u'zh': 1, u'ja': 1, u'es': 2}
4 113 {u'zh': 1, u'ja': 1, u'es': 2}
5 113 {u'zh': 1, u'ja': 1, u'es': 2}
6 114 {u'es': 1}
7 113 {u'zh': 1, u'ja': 1, u'es': 2}
In [6]: def whatever(row):
....: tmp_d = {}
....: for d in row.values:
....: for k in d.keys():
....: if k in tmp_d.keys():
....: tmp_d[k] += 1
....: else:
....: tmp_d[k] = 1
....: return tmp_d
In [7]: new_df = df.groupby('User')['CLang'].apply(whatever).unstack().fillna(0)
In [8]: new_df
Out[8]:
en es ja zh
User
111 1 0 0 0
112 2 2 0 0
113 0 4 4 4
114 0 1 0 0
If you then want to know what was the CLang
with more occurrences you can, also not very elegantly since list
in DataFrame
should be avoided, do:
In [9]: def whatever2(row):
....: tmp_d = {}
....: for i, v in zip(row.index, row.values):
....: if v in tmp_d.keys():
....: tmp_d[v].append(i)
....: else:
....: tmp_d[v] = [i]
....: highest = max(tmp_d.keys())
....: return tmp_d[highest]
In [10]: new_df['Most_Used_CLang'] = new_df.apply(whatever2, axis=1)
In [11]: new_df
Out[11]:
en es ja zh Most_Used_CLang
User
111 1 0 0 0 [en]
112 2 2 0 0 [en, es]
113 0 4 4 4 [es, ja, zh]
114 0 1 0 0 [es]
Upvotes: 0