Reputation: 172
I have a pandas dataframe that looks like this:
from to
a b
b a
c d
c d
d c
I want to find the count of the combination of from
and to
regardless of the order so I'll end up with something like:
places count
[a,b] 2
[c,d] 3
I'm struggling to find a effective way of achieving this. Any help would be much appreciated.
Upvotes: 3
Views: 301
Reputation: 164623
You can use collections.Counter
for an O(n) solution:
from collections import Counter
c = Counter(map(frozenset, (zip(df['from'], df['to']))))
res = pd.DataFrame.from_dict(c, orient='index').reset_index()
print(res)
# index 0
# 0 (a, b) 2
# 1 (c, d) 3
Note conversion to frozenset
is required since Counter
only works on hashable objects. However, this should be more efficient than a groupby
solution.
Upvotes: 2
Reputation: 18906
You can use value_counts() with the elements in zipped columns with frozenset. This can cause you to get ['d','c']. If you however prefer them sorted you can go and do: tuple(sorted(i)) for i in zip()
instead of map(frozenset,...)
. There seem to be a 4x
speed-boost compared to using the groupby-solution. Update: The speed comparison is not really fair as the two solutions does different things.
import pandas as pd
data = '''\
from to
a b
b a
c d
c d
d c'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
out = pd.Series(map(frozenset,zip(df['from'],df['to']))).value_counts().reset_index()
out.rename(columns={'index':'places',0:'count'}, inplace=True)
print(out)
And you get:
places count
0 (d, c) 3
1 (a, b) 2
Time comparison:
%timeit pd.Series(map(frozenset,zip(df['from'],df['to']))).value_counts()
%timeit df.apply(np.sort, axis=1).groupby(['from','to']).size()
1000 loops, best of 3: 845 µs per loop
100 loops, best of 3: 3.45 ms per loop
Upvotes: 3
Reputation: 51165
You could use numpy.sort()
and groupby
:
In [41]: df.apply(np.sort, axis=1).groupby(['from','to']).size()
Out[41]:
from to
a b 2
c d 3
dtype: int64
Upvotes: 2