Reputation: 96264
Say I have a dataframe with N
columns (e.g. N=3
). Every row represents a sample:
A B C
sample_1 64 46 69
sample_2 55 33 40
sample_3 67 51 78
sample_4 97 32 62
sample_5 50 36 39
I would like to know what is the most common ordering of the columns A
, B
, C
across rows.
In the case above, one could sort every row manually:
sample_1: [B, A, C]
sample_2: [B, C, A]
sample_3: [B, A, C]
sample_4: [B, C, A]
sample_5: [B, C, A]
and then find out that the most common ordering is [B, C, A]
, while [B, A, C]
is the second most common.
Are there any functions in Pandas, scipy or statsmodels that facilitate this analysis? For example, what if I want to find out how often each ordering happens?
Upvotes: 2
Views: 71
Reputation: 77941
Maybe:
>>> from collections import Counter
>>> f = lambda ts: df.columns[np.argsort(ts).values]
>>> Counter(map(tuple, df.apply(f, axis=1).values))
Counter({('B', 'C', 'A'): 3, ('B', 'A', 'C'): 2})
So the most common ordering is:
>>> _.most_common(1)
[(('B', 'C', 'A'), 3)]
Alternatively:
>>> f = lambda ts: tuple(df.columns[np.argsort(ts)])
>>> df.apply(f, axis=1, raw=True).value_counts()
(B, C, A) 3
(B, A, C) 2
dtype: int64
Upvotes: 4
Reputation: 375475
It can be more efficient to use the cythonized rank
function:
In [11]: df.rank(axis=1)
Out[11]:
A B C
sample_1 2 1 3
sample_2 3 1 2
sample_3 2 1 3
sample_4 3 1 2
sample_5 3 1 2
You could then do a groupby, for example to get the sizes:
In [12]: df.rank(axis=1).groupby(['A', 'B', 'C']).size()
Out[12]:
A B C
2 1 3 2
3 1 2 3
dtype: int64
Note: Here we're reading 2 1 3 to mean ["B", "A", "C"], however you could replace the index of the result as desired (i.e. replace it with one of these):
In [13]: res.index.map(lambda y: ["_ABC"[int(x)] for x in y])
Out[13]: array([['B', 'A', 'C'], ['C', 'A', 'B']], dtype=object)
In [14]: res.index.map(lambda y: "".join(["_ABC"[int(x)] for x in y]))
Out[14]: array(['BAC', 'CAB'], dtype=object)
Here's the performance for a slightly larger dataframe:
In [21]: df1 = pd.concat([df] * 1000, ignore_index=True)
In [22]: %timeit df1.rank(axis=1).groupby(['A', 'B', 'C']).size()
100 loops, best of 3: 4.82 ms per loop
In [23]: %timeit Counter(map(tuple, df1.apply(f, axis=1).values))
1 loops, best of 3: 1.68 s per loop
Upvotes: 4