Tomsky
Tomsky

Reputation: 172

Count of combination of columns regardless of order

I have a pandas dataframe that looks like this:

from    to  
a       b
b       a
c       d
c       d
d       c

I want to find the count of the combination of from and to regardless of the order so I'll end up with something like:

places  count
[a,b]   2
[c,d]   3

I'm struggling to find a effective way of achieving this. Any help would be much appreciated.

Upvotes: 3

Views: 301

Answers (3)

jpp
jpp

Reputation: 164623

You can use collections.Counter for an O(n) solution:

from collections import Counter

c = Counter(map(frozenset, (zip(df['from'], df['to']))))

res = pd.DataFrame.from_dict(c, orient='index').reset_index()

print(res)

#     index  0
# 0  (a, b)  2
# 1  (c, d)  3

Note conversion to frozenset is required since Counter only works on hashable objects. However, this should be more efficient than a groupby solution.

Upvotes: 2

Anton vBR
Anton vBR

Reputation: 18906

You can use value_counts() with the elements in zipped columns with frozenset. This can cause you to get ['d','c']. If you however prefer them sorted you can go and do: tuple(sorted(i)) for i in zip() instead of map(frozenset,...). There seem to be a 4x speed-boost compared to using the groupby-solution. Update: The speed comparison is not really fair as the two solutions does different things.

import pandas as pd

data = '''\
from    to  
a       b
b       a
c       d
c       d
d       c'''

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')

out = pd.Series(map(frozenset,zip(df['from'],df['to']))).value_counts().reset_index()
out.rename(columns={'index':'places',0:'count'}, inplace=True)

print(out)

And you get:

   places  count
0  (d, c)      3
1  (a, b)      2

Time comparison:

%timeit pd.Series(map(frozenset,zip(df['from'],df['to']))).value_counts()
%timeit df.apply(np.sort, axis=1).groupby(['from','to']).size()

1000 loops, best of 3: 845 µs per loop
100 loops, best of 3: 3.45 ms per loop

Upvotes: 3

user3483203
user3483203

Reputation: 51165

You could use numpy.sort() and groupby:

In [41]: df.apply(np.sort, axis=1).groupby(['from','to']).size()
Out[41]:
from  to
a     b     2
c     d     3
dtype: int64

Upvotes: 2

Related Questions