Statistics of the ordering of columns

Question

Say I have a dataframe with N columns (e.g. N=3). Every row represents a sample:

                A        B        C                                
sample_1       64       46       69
sample_2       55       33       40
sample_3       67       51       78
sample_4       97       32       62
sample_5       50       36       39

I would like to know what is the most common ordering of the columns A, B, C across rows.

In the case above, one could sort every row manually:

sample_1: [B, A, C]
sample_2: [B, C, A] 
sample_3: [B, A, C]
sample_4: [B, C, A] 
sample_5: [B, C, A]

and then find out that the most common ordering is [B, C, A], while [B, A, C] is the second most common.

Are there any functions in Pandas, scipy or statsmodels that facilitate this analysis? For example, what if I want to find out how often each ordering happens?

behzad.nouri · Accepted Answer

Maybe:

>>> from collections import Counter
>>> f = lambda ts: df.columns[np.argsort(ts).values]
>>> Counter(map(tuple, df.apply(f, axis=1).values))
Counter({('B', 'C', 'A'): 3, ('B', 'A', 'C'): 2})

So the most common ordering is:

>>> _.most_common(1)
[(('B', 'C', 'A'), 3)]

Alternatively:

>>> f = lambda ts: tuple(df.columns[np.argsort(ts)])
>>> df.apply(f, axis=1, raw=True).value_counts()
(B, C, A)    3
(B, A, C)    2
dtype: int64

Statistics of the ordering of columns

Answers (2)

Related Questions