Fabio Lamanna
Fabio Lamanna

Reputation: 21574

Access dictionary keys and values in pandas dataframe column

I've got a simple dataframe with a column populated by a python dictionary, in the form:

User                          CLang
111                      {u'en': 1}
112            {u'en': 1, u'es': 1}
112            {u'en': 1, u'es': 1}
113  {u'zh': 1, u'ja': 1, u'es': 2}
113  {u'zh': 1, u'ja': 1, u'es': 2}
113  {u'zh': 1, u'ja': 1, u'es': 2}
114                      {u'es': 1}
113  {u'zh': 1, u'ja': 1, u'es': 2}

The CLang column contains the frequency of different values for each user. How may I have access to single keys and values of the CLang column? For instance I would like to groupby the User and the most frequent value inside the dictionary, in a form like:

g = df.groupby(['User','CLang')

counting then the number of occurrences for each value:

d = g.size().unstack().fillna(0)

The resulting dataframe would appear as:

DLang  en  es
User                 
111     1   0
112     1   1
113     0   4
114     0   1

Upvotes: 3

Views: 11909

Answers (2)

fixxxer
fixxxer

Reputation: 16174

This, I guess, would be self-explanatory:

> In [413]: x
Out[413]: 
                            CLang  User
0                      {u'en': 1}   111
1            {u'en': 1, u'es': 1}   112
2            {u'en': 1, u'es': 1}   112
3  {u'zh': 1, u'ja': 1, u'es': 2}   113
4  {u'zh': 1, u'ja': 1, u'es': 2}   113
5  {u'zh': 1, u'ja': 1, u'es': 2}   113
6                      {u'es': 1}   114
7  {u'zh': 1, u'ja': 1, u'es': 2}   113

In [414]: paste
temp = pd.DataFrame(index=x['User'])
for i in x.iterrows():
        clan = i[1]['CLang']
        user = i[1]['User']
        for j in clan.iteritems():
                if j[0] not in temp.columns:
                        temp[j[0]] = pd.np.NaN
                        temp[j[0]][user] = j[1]
                else:
                        temp[j[0]][user] += j[1]

print temp

## -- End pasted text --
      en  es  zh  ja
User                
111    1 NaN NaN NaN
112  NaN   2 NaN NaN
112  NaN   2 NaN NaN
113  NaN NaN   4   4
113  NaN NaN   4   4
113  NaN NaN   4   4
114  NaN NaN NaN NaN
113  NaN NaN   4   4

In [415]: temp.reset_index().groupby('User').sum()
Out[415]: 
      en  es  zh  ja
User                
111    1 NaN NaN NaN
112  NaN   4 NaN NaN
113  NaN NaN  16  16
114  NaN NaN NaN NaN

In [416]: 

Upvotes: 1

paulo.filip3
paulo.filip3

Reputation: 3297

I'm not completely sure I understood correctly what you want your output to be and also I don't think using dict in pandas.DataFrame is a very good idea in general.

Reshaping your DataFrame to something more pandas-like would be better, you would then be able to use pandas methods to solve this problem.

Anyway, if you really want to do it, here's a (not very elegant) way:

In [1]: import pandas as pd
In [2]: l1 = [111, 112, 112, 113, 113, 113, 114, 113]
In [3]: l2 = [{'en': 1},
              {'en': 1, 'es': 1},
              {'en': 1, 'es': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 1},
              {'es': 2, 'ja': 1, 'zh': 1}]

In [4]: df = pd.DataFrame({'User': l1, 'CLang': l2})

In [5]: df
Out[5]: 
   User                           CLang
0   111                      {u'en': 1}
1   112            {u'en': 1, u'es': 1}
2   112            {u'en': 1, u'es': 1}
3   113  {u'zh': 1, u'ja': 1, u'es': 2}
4   113  {u'zh': 1, u'ja': 1, u'es': 2}
5   113  {u'zh': 1, u'ja': 1, u'es': 2}
6   114                      {u'es': 1}
7   113  {u'zh': 1, u'ja': 1, u'es': 2}

In [6]: def whatever(row):
  ....:     tmp_d = {}
  ....:     for d in row.values:
  ....:         for k in d.keys():
  ....:             if k in tmp_d.keys():
  ....:                 tmp_d[k] += 1
  ....:             else:
  ....:                 tmp_d[k] = 1
  ....:     return tmp_d

In [7]: new_df = df.groupby('User')['CLang'].apply(whatever).unstack().fillna(0)

In [8]: new_df
Out[8]: 
      en  es  ja  zh
User                
111    1   0   0   0
112    2   2   0   0
113    0   4   4   4
114    0   1   0   0

If you then want to know what was the CLang with more occurrences you can, also not very elegantly since list in DataFrame should be avoided, do:

In [9]: def whatever2(row):                                                  
  ....:     tmp_d = {}
  ....:     for i, v in zip(row.index, row.values):
  ....:         if v in tmp_d.keys():
  ....:             tmp_d[v].append(i)
  ....:         else:
  ....:             tmp_d[v] = [i]
  ....:     highest = max(tmp_d.keys())
  ....:     return tmp_d[highest]

In [10]: new_df['Most_Used_CLang'] = new_df.apply(whatever2, axis=1)
In [11]: new_df
Out[11]:
      en  es  ja  zh Most_Used_CLang
User                                
111    1   0   0   0            [en]
112    2   2   0   0        [en, es]
113    0   4   4   4    [es, ja, zh]
114    0   1   0   0            [es]

Upvotes: 0

Related Questions