Chrestomanci
Chrestomanci

Reputation: 221

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.

Sample dataframe header structure excerpt:

id    dob    gender    pro_1    pro_10   pro_11   pro_2 ... pro_9    pre_1   pre_10   ...

I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:

id    dob    gender    pro_1    pre_1    que_1    fre_1    gen_1    pro2    pre_2    que_2    fre_2    ...

(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)

Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!

EDIT:

Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:

id    dob    gender    pro_1    pro_2    pro_3    ...    pre_1    pre_2    pre_3   ...

EDIT 2.0:

Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.

Upvotes: 2

Views: 585

Answers (4)

Supratim Haldar
Supratim Haldar

Reputation: 2416

Try this.

To re-order the columns based on the number after the column name

cols_fixed = df.columns[:3]  # change index no based on your df
cols_variable = df.columns[3:]  # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1]))  # split based on the number after '_'
cols_new = cols_fixed + cols_variable 
new_df = pd.DataFrame(df[cols_new])

To re-arrange column names based on the string part AND number part of the column names

cols_fixed = df.columns[:3]  # change index no based on your df
cols_variable = df.columns[3:]  # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable 
new_df = pd.DataFrame(df[cols_new])

Upvotes: 0

jxc
jxc

Reputation: 13998

Here is one way you can try:

# column names copied from your example
example_cols = 'id    dob    gender    pro_1    pro_10   pro_11   pro_2  pro_9    pre_1   pre_10'.split()

# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
#   id  dob  gender  pro_1  pro_10  pro_11  pro_2  pro_9  pre_1  pre_10
#0   0    1       2      3       4       5      6      7      8       9

# number of columns excluded from sorting
N = 3

# get a list of columns from the dataframe
cols = df.columns.tolist()

# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]

# get the new dataframe based on the cols_new
df_new = df[cols_new]
#   id  dob  gender  pre_1  pre_10  pro_1  pro_2  pro_9  pro_10  pro_11
#0   0    1       2      8       9      3      6      7       4       5

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153460

You need to split you column on '_' then convert to int:

c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']

df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)

Output:

   A_1  B_1  A_2  B_2  A_3  B_3  A_10  B_10
0   68   11   59   69   37   68    76    17
1   19   37   52   54   23   93    85     3

Next case, you need human sorting:

import re
def atoi(text):
    return int(text) if text.isdigit() else text

def natural_keys(text):
    '''
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)
    '''
    return [ atoi(c) for c in re.split(r'(\d+)', text) ]



df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)

Output:

   A_1  A_2  A_3  A_10  B_1  B_2  B_3  B_10
0   68   59   37    76   11   69   68    17
1   19   52   23    85   37   54   93     3

Upvotes: 1

Edeki Okoh
Edeki Okoh

Reputation: 1844

Luckily there is a one liner in python that can fix this:

df = df.reindex(sorted(df.columns), axis=1)

For Example lets say you had this dataframe:

import pandas as pd import numpy as np

df = pd.DataFrame({'Name': [2, 4, 8, 0],
                   'ID': [2, 0, 0, 0],
                   'Prod3': [10, 2, 1, 8],
                   'Prod1': [2, 4, 8, 0],
                   'Prod_1': [2, 4, 8, 0],
                   'Pre7': [2, 0, 0, 0],
                   'Pre2': [10, 2, 1, 8],
                   'Pre_2': [10, 2, 1, 8],
                   'Pre_9': [10, 2, 1, 8]}
                   )

print(df)

Output:

   Name  ID  Prod3  Prod1  Prod_1  Pre7  Pre2  Pre_2  Pre_9
0     2   2     10      2       2     2    10     10     10
1     4   0      2      4       4     0     2      2      2
2     8   0      1      8       8     0     1      1      1
3     0   0      8      0       0     0     8      8      8

Then used

df = df.reindex(sorted(df.columns), axis=1)

Then the dataframe will then look like:

   ID  Name  Pre2  Pre7  Pre_2  Pre_9  Prod1  Prod3  Prod_1
0   2     2    10     2     10     10      2     10       2
1   0     4     2     0      2      2      4      2       4
2   0     8     1     0      1      1      8      1       8
3   0     0     8     0      8      8      0      8       0

As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.

Upvotes: 1

Related Questions