Reputation: 21552
Starting from the following dataframe df
:
df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
I'm trying to build the structure:
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [uz, es] [1, 1]
so basically grouping the lang
elements and frequency per node into a single row through lists. What I've done so far:
# Getting the unique langs / node
a = df.groupby('node')['lang'].unique().reset_index(name='langs')
# Getting the frequency of lang / node
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')
c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')
and then merge on node
:
d = pd.merge(a,c,on='node')
After this operations, what I obtained is:
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [uz, es] [1]
As you may notice, the last row has only one [1]
occurrence of the frequency of the two [uz, es]
instead of a list of [1,1]
as expected. Is there a way to perform the analysis in a more concise way obtaining the desired output?
Upvotes: 4
Views: 1172
Reputation: 30424
In part because you mention (in a comment) the importance of speed for having 40 million rows, I would suggest looking at something closer to the following.
df.groupby(['node','lang'])['lang'].count()
node lang
1 it 1
2 en 1
3 ar 2
es 1
5 es 1
uz 1
Generally speaking you're going to be better off with a flatter structure (zen of python) and more specifically you want your pandas/numpy columns to be of simple types (ints and floats), not objects.
On account of pandas methods like groupby, the above structure ought to more easily do things than if you store as lists, and is pretty much guaranteed to be faster, probably much faster. I'm assuming you want to use this structure for further processing, but even if not, it will be faster to tabulate the data in this way also.
Upvotes: 2
Reputation: 862601
You can apply
np.unique
with parameter return_counts=True
:
df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
print df
lang node
0 it 1
1 en 2
2 ar 3
3 ar 3
4 es 3
5 uz 5
6 es 5
a = df.groupby('node')['lang'].apply(lambda x: np.unique(x, return_counts=True))
.reset_index(name='tup')
#split tuples
a[['langs','lfreq']] = a['tup'].apply(pd.Series)
#filter columns
print a[['node','langs','lfreq']]
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [es, uz] [1, 1]
Upvotes: 2
Reputation: 1719
I would use the agg function and tolist()
df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
# Getting the unique langs / node
a = df.groupby('node')['lang'].unique().reset_index(name='langs')
# Getting the frequency of lang / node
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')
replace
c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')
with
c = b.groupby('node').agg({'lfreq': lambda x: x.tolist()}).reset_index()
d = pd.merge(a,c,on='node')
and viola:
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [uz, es] [1, 1]
Upvotes: 3