Reputation: 115
The idea follows; the whole data frame should be ordered by column 4 while keeping the exact sequence of information in every row when sorting by df_train[4] (e.g 'Table1', 'Table1', 'Table1', 'Table2','Table2', 'Table3','Table3', 'Table4)
Attempts and outputs:
import pandas as pd
from pandas import np
df_train = pd.DataFrame({1: ['1234', '4321','4321', '3241', '5432','5821', '4442', '4567', '0952','0952','7426','4422','4121','8885','4231','5932','5932','5555','0532','1111','5555','6321','6321','9955','3333','3333','3333','4132']})
df_train[2] = ['Mayo', 'Mango','Mango','Apple','Kiwi','Watermelon','Orange','Peach','Broccoli','Broccoli','Name','Surname','Location','Carrots','Salad','Water','Water','Wine','Beer','Cup','Rose','Flower','Flower','Paint','Van','Van','Van','Car']
df_train[3] = ['1','2','2','1','1','1','1','1','2','2','1','1','1','1','1','2','2','1','1','1','1','2','2','1','3','3','3','1']
df_train[4] = ['Table1','Table1','Table1','Table2','Table2','Table3','Table3','Table4','Table4','Table4','Table1','Table1','Table1','Table1','Table2','Table2','Table2','Table3','Table4','Table4','Table1','Table1','Table1','Table2','Table3','Table3','Table3','Table4']
output:
1 2 3 4
0 1234 Mayo 1 Table1
1 4321 Mango 2 Table1
2 4321 Mango 2 Table1
3 3241 Apple 1 Table2
4 5432 Kiwi 1 Table2
5 5821 Watermelon 1 Table3
6 4442 Orange 1 Table3
7 4567 Peach 1 Table4
8 0952 Broccoli 2 Table4
9 0952 Broccoli 2 Table4
10 7426 Name 1 Table1
11 4422 Surname 1 Table1
12 4121 Location 1 Table1
13 8885 Carrots 1 Table1
14 4231 Salad 1 Table2
15 5932 Water 2 Table2
16 5932 Water 2 Table2
17 5555 Wine 1 Table3
18 0532 Beer 1 Table4
19 1111 Cup 1 Table4
20 5555 Rose 1 Table1
21 6321 Flower 2 Table1
22 6321 Flower 2 Table1
23 9955 Paint 1 Table2
24 3333 Van 3 Table3
25 3333 Van 3 Table3
26 3333 Van 3 Table3
27 4132 Car 1 Table4
Attempt 1 & 2:
df_train2 = df_train.sort_values(4, ignore_index=True) #
df_train2 = df_train.sort_values(4, ignore_index=False) #same issue
The output is ordered by all the targeted column 4 but the sequence is rather arbitrary no matter of indices:
1 2 3 4
0 1234 Mayo 1 Table1
1 6321 Flower 2 Table1 -> should be: 4321, Mango, 2, Table1
2 6321 Flower 2 Table1 -> should be: 4321, Mango, 2, Table1
3 5555 Rose 1 Table1 -> should be: 7426, Name, 1, Table1
4 4121 Location 1 Table1 -> should be: 4422, Surname, 1, Table1
5 4422 Surname 1 Table1 -> should be: 4121, Location, 1, Table1
6 7426 Name 1 Table1 ...etc
7 8885 Carrots 1 Table1
8 4321 Mango 2 Table1
9 4321 Mango 2 Table1
10 5432 Kiwi 1 Table2
11 4231 Salad 1 Table2
12 5932 Water 2 Table2
13 5932 Water 2 Table2
14 9955 Paint 1 Table2
15 3241 Apple 1 Table2
16 5555 Wine 1 Table3
17 3333 Van 3 Table3
18 4442 Orange 1 Table3
19 5821 Watermelon 1 Table3
20 3333 Van 3 Table3
21 3333 Van 3 Table3
22 0952 Broccoli 2 Table4
23 0532 Beer 1 Table4
24 4567 Peach 1 Table4
25 0952 Broccoli 2 Table4
Attempt 3 & 4:
# Attempt 3:
df_train3 = df_train.copy()
df_train3.loc[3:] = df_train3.loc[3:].sort_values(4).set_index(df_train3.loc[3:].index)
Didn't get the desired output even when slicing from the 3rd row
# Attempt 4:
isolate_digit = []
for indx,row in df_train.iterrows():
isolate_digit.append( int(''.join(filter(str.isdigit,row[4]))))
df_train['order'] = isolate_digit
Tried even isolating from the 4th column it's integer, thought maybe since it's a string ('Table1'..etc) that the alg doesn't preserve the order because of it, still not solving the issue.
Desired output, by slicing manually:
# Manually slicing desired output:
df_table_seq1 = df_train.iloc[0:3] # Table1 sequence
df_table_seq2 = df_train.iloc[10:14] # Table1 seuqence
df_table_seq3 = df_train.iloc[20:23] # Table1 sequence
df_table_seq4 = df_train.iloc[3:5] # Table2 sequence
df_table_seq5 = df_train.iloc[14:17] # Table2 seuqence
df_table_seq6 = df_train.iloc[23:24] # Table2 sequence
df_table_seq7 = df_train.iloc[5:7] # Table3 sequence
df_table_seq8 = df_train.iloc[17:18] # Table3 seuqence
df_table_seq9 = df_train.iloc[24:27] # Table3 sequence
df_table_seq10 = df_train.iloc[7:10] # Table4 seuqence
df_table_seq11 = df_train.iloc[18:20] # Table4 sequence
df_table_seq12 = df_train.iloc[27:28] # Table4 sequence
df_final = df_table_seq1+df_table_seq2
df_final = pd.concat([df_table_seq1,df_table_seq2,df_table_seq3,df_table_seq4,df_table_seq5,df_table_seq6,df_table_seq7,df_table_seq8,df_table_seq9,df_table_seq10,df_table_seq11,df_table_seq12], sort = False)
Output and as it should be:
1 2 3 4 order
0 1234 Mayo 1 Table1 1
1 4321 Mango 2 Table1 1
2 4321 Mango 2 Table1 1
10 7426 Name 1 Table1 1
11 4422 Surname 1 Table1 1
12 4121 Location 1 Table1 1
13 8885 Carrots 1 Table1 1
20 5555 Rose 1 Table1 1
21 6321 Flower 2 Table1 1
22 6321 Flower 2 Table1 1
3 3241 Apple 1 Table2 2
4 5432 Kiwi 1 Table2 2
14 4231 Salad 1 Table2 2
15 5932 Water 2 Table2 2
16 5932 Water 2 Table2 2
23 9955 Paint 1 Table2 2
5 5821 Watermelon 1 Table3 3
6 4442 Orange 1 Table3 3
17 5555 Wine 1 Table3 3
24 3333 Van 3 Table3 3
25 3333 Van 3 Table3 3
26 3333 Van 3 Table3 3
7 4567 Peach 1 Table4 4
8 0952 Broccoli 2 Table4 4
9 0952 Broccoli 2 Table4 4
18 0532 Beer 1 Table4 4
19 1111 Cup 1 Table4 4
27 4132 Car 1 Table4 4
Upvotes: 0
Views: 137
Reputation: 1738
Make a helper column that is a copy of the index then sort by column 4 and the helper column
df['sort_helper'] = df.index
df.sort_values(['4', 'sort_helper'])
Upvotes: 2