Ecko
Ecko

Reputation: 115

Sorting complete pandas data frame by a column while preserving sequence order

The idea follows; the whole data frame should be ordered by column 4 while keeping the exact sequence of information in every row when sorting by df_train[4] (e.g 'Table1', 'Table1', 'Table1', 'Table2','Table2', 'Table3','Table3', 'Table4)

Attempts and outputs:

import pandas as pd
from pandas import np

df_train = pd.DataFrame({1: ['1234', '4321','4321', '3241', '5432','5821', '4442', '4567', '0952','0952','7426','4422','4121','8885','4231','5932','5932','5555','0532','1111','5555','6321','6321','9955','3333','3333','3333','4132']})
df_train[2] = ['Mayo', 'Mango','Mango','Apple','Kiwi','Watermelon','Orange','Peach','Broccoli','Broccoli','Name','Surname','Location','Carrots','Salad','Water','Water','Wine','Beer','Cup','Rose','Flower','Flower','Paint','Van','Van','Van','Car']
df_train[3] = ['1','2','2','1','1','1','1','1','2','2','1','1','1','1','1','2','2','1','1','1','1','2','2','1','3','3','3','1']
df_train[4] = ['Table1','Table1','Table1','Table2','Table2','Table3','Table3','Table4','Table4','Table4','Table1','Table1','Table1','Table1','Table2','Table2','Table2','Table3','Table4','Table4','Table1','Table1','Table1','Table2','Table3','Table3','Table3','Table4']
output: 
    1        2          3      4
0   1234    Mayo        1   Table1
1   4321    Mango       2   Table1
2   4321    Mango       2   Table1
3   3241    Apple       1   Table2
4   5432    Kiwi        1   Table2
5   5821    Watermelon  1   Table3
6   4442    Orange      1   Table3
7   4567    Peach       1   Table4
8   0952    Broccoli    2   Table4
9   0952    Broccoli    2   Table4
10  7426    Name        1   Table1
11  4422    Surname     1   Table1
12  4121    Location    1   Table1
13  8885    Carrots     1   Table1
14  4231    Salad       1   Table2
15  5932    Water       2   Table2
16  5932    Water       2   Table2
17  5555    Wine        1   Table3
18  0532    Beer        1   Table4
19  1111    Cup         1   Table4
20  5555    Rose        1   Table1
21  6321    Flower      2   Table1
22  6321    Flower      2   Table1
23  9955    Paint       1   Table2
24  3333    Van         3   Table3
25  3333    Van         3   Table3
26  3333    Van         3   Table3
27  4132    Car         1   Table4

Attempt 1 & 2:

df_train2 = df_train.sort_values(4, ignore_index=True) # 
df_train2 = df_train.sort_values(4, ignore_index=False) #same issue

The output is ordered by all the targeted column 4 but the sequence is rather arbitrary no matter of indices:

    1        2          3     4
0   1234    Mayo        1   Table1 
1   6321    Flower      2   Table1 -> should be: 4321, Mango, 2, Table1
2   6321    Flower      2   Table1 -> should be: 4321, Mango, 2, Table1
3   5555    Rose        1   Table1 -> should be: 7426, Name, 1, Table1
4   4121    Location    1   Table1 -> should be: 4422, Surname, 1, Table1
5   4422    Surname     1   Table1 -> should be: 4121, Location, 1, Table1
6   7426    Name        1   Table1 ...etc
7   8885    Carrots     1   Table1
8   4321    Mango       2   Table1
9   4321    Mango       2   Table1
10  5432    Kiwi        1   Table2
11  4231    Salad       1   Table2
12  5932    Water       2   Table2
13  5932    Water       2   Table2
14  9955    Paint       1   Table2
15  3241    Apple       1   Table2
16  5555    Wine        1   Table3
17  3333    Van         3   Table3
18  4442    Orange      1   Table3
19  5821    Watermelon  1   Table3
20  3333    Van         3   Table3
21  3333    Van         3   Table3
22  0952    Broccoli    2   Table4
23  0532    Beer        1   Table4
24  4567    Peach       1   Table4
25  0952    Broccoli    2   Table4

Attempt 3 & 4:

# Attempt 3:
df_train3 = df_train.copy()
df_train3.loc[3:] = df_train3.loc[3:].sort_values(4).set_index(df_train3.loc[3:].index)

Didn't get the desired output even when slicing from the 3rd row

# Attempt 4:
isolate_digit = []
for indx,row in df_train.iterrows():
    isolate_digit.append( int(''.join(filter(str.isdigit,row[4]))))

df_train['order'] = isolate_digit 

Tried even isolating from the 4th column it's integer, thought maybe since it's a string ('Table1'..etc) that the alg doesn't preserve the order because of it, still not solving the issue.

Desired output, by slicing manually:

# Manually slicing desired output:

df_table_seq1 = df_train.iloc[0:3]  # Table1 sequence
df_table_seq2 = df_train.iloc[10:14] # Table1 seuqence
df_table_seq3 = df_train.iloc[20:23] # Table1 sequence

df_table_seq4 = df_train.iloc[3:5]  # Table2 sequence
df_table_seq5 = df_train.iloc[14:17] # Table2 seuqence
df_table_seq6 = df_train.iloc[23:24] # Table2 sequence

df_table_seq7 = df_train.iloc[5:7]  # Table3 sequence
df_table_seq8 = df_train.iloc[17:18] # Table3 seuqence
df_table_seq9 = df_train.iloc[24:27] # Table3 sequence


df_table_seq10 = df_train.iloc[7:10] # Table4 seuqence
df_table_seq11 = df_train.iloc[18:20] # Table4 sequence
df_table_seq12 = df_train.iloc[27:28] # Table4 sequence

df_final = df_table_seq1+df_table_seq2

df_final = pd.concat([df_table_seq1,df_table_seq2,df_table_seq3,df_table_seq4,df_table_seq5,df_table_seq6,df_table_seq7,df_table_seq8,df_table_seq9,df_table_seq10,df_table_seq11,df_table_seq12], sort = False)

Output and as it should be:

    1         2          3      4      order
0   1234    Mayo         1    Table1    1
1   4321    Mango        2    Table1    1
2   4321    Mango        2    Table1    1
10  7426    Name         1    Table1    1
11  4422    Surname      1    Table1    1
12  4121    Location     1    Table1    1
13  8885    Carrots      1    Table1    1
20  5555    Rose         1    Table1    1
21  6321    Flower       2    Table1    1
22  6321    Flower       2    Table1    1
3   3241    Apple        1    Table2    2
4   5432    Kiwi         1    Table2    2
14  4231    Salad        1    Table2    2
15  5932    Water        2    Table2    2
16  5932    Water        2    Table2    2
23  9955    Paint        1    Table2    2
5   5821    Watermelon   1    Table3    3
6   4442    Orange       1    Table3    3
17  5555    Wine         1    Table3    3
24  3333    Van          3    Table3    3
25  3333    Van          3    Table3    3
26  3333    Van          3    Table3    3
7   4567    Peach        1    Table4    4
8   0952    Broccoli     2    Table4    4
9   0952    Broccoli     2    Table4    4
18  0532    Beer         1    Table4    4
19  1111    Cup          1    Table4    4
27  4132    Car          1    Table4    4

Upvotes: 0

Views: 137

Answers (1)

mullinscr
mullinscr

Reputation: 1738

Make a helper column that is a copy of the index then sort by column 4 and the helper column

df['sort_helper'] = df.index
df.sort_values(['4', 'sort_helper'])

Upvotes: 2

Related Questions