Fabio Lamanna
Fabio Lamanna

Reputation: 21552

pandas - create dataframe with counts and frequency of elements

Starting from the following dataframe df:

df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})

I'm trying to build the structure:

    node     langs   lfreq
0      1      [it]     [1]
1      2      [en]     [1]
2      3  [ar, es]  [2, 1]
3      5  [uz, es]  [1, 1]

so basically grouping the lang elements and frequency per node into a single row through lists. What I've done so far:

# Getting the unique langs / node
a = df.groupby('node')['lang'].unique().reset_index(name='langs')

# Getting the frequency of lang / node
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')
c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')

and then merge on node:

d = pd.merge(a,c,on='node')

After this operations, what I obtained is:

    node     langs   lfreq
0      1      [it]     [1]
1      2      [en]     [1]
2      3  [ar, es]  [2, 1]
3      5  [uz, es]     [1]

As you may notice, the last row has only one [1] occurrence of the frequency of the two [uz, es] instead of a list of [1,1] as expected. Is there a way to perform the analysis in a more concise way obtaining the desired output?

Upvotes: 4

Views: 1172

Answers (3)

JohnE
JohnE

Reputation: 30424

In part because you mention (in a comment) the importance of speed for having 40 million rows, I would suggest looking at something closer to the following.

df.groupby(['node','lang'])['lang'].count()

node  lang
1     it      1
2     en      1
3     ar      2
      es      1
5     es      1
      uz      1

Generally speaking you're going to be better off with a flatter structure (zen of python) and more specifically you want your pandas/numpy columns to be of simple types (ints and floats), not objects.

On account of pandas methods like groupby, the above structure ought to more easily do things than if you store as lists, and is pretty much guaranteed to be faster, probably much faster. I'm assuming you want to use this structure for further processing, but even if not, it will be faster to tabulate the data in this way also.

Upvotes: 2

jezrael
jezrael

Reputation: 862601

You can apply np.unique with parameter return_counts=True:

df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
print df
  lang  node
0   it     1
1   en     2
2   ar     3
3   ar     3
4   es     3
5   uz     5
6   es     5

a = df.groupby('node')['lang'].apply(lambda x: np.unique(x, return_counts=True))
                              .reset_index(name='tup')

#split tuples
a[['langs','lfreq']] = a['tup'].apply(pd.Series)
#filter columns
print a[['node','langs','lfreq']]
   node     langs   lfreq
0     1      [it]     [1]
1     2      [en]     [1]
2     3  [ar, es]  [2, 1]
3     5  [es, uz]  [1, 1]

Upvotes: 2

dmb
dmb

Reputation: 1719

I would use the agg function and tolist()

df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
# Getting the unique langs / node
a = df.groupby('node')['lang'].unique().reset_index(name='langs')

# Getting the frequency of lang / node
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')

replace

c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')

with

c = b.groupby('node').agg({'lfreq': lambda x: x.tolist()}).reset_index()

d = pd.merge(a,c,on='node')

and viola:

   node     langs   lfreq
0     1      [it]     [1]
1     2      [en]     [1]
2     3  [ar, es]  [2, 1]
3     5  [uz, es]  [1, 1]

Upvotes: 3

Related Questions