hyeri
hyeri

Reputation: 693

Sorting rows in python pandas

I have a dataframe (the sample looks like this)

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,M       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots S                              S        330
Variation      2.7      Boots M                              M        330
Variable       3        Helmet           Helmet Sizes      E42,E41
Variation      3.8      Helmet E42                          E42       89
Variation      3.2      Helmet E41                          E41       89

What I want to do is sort the values based on Size so the final data frame should look like this:

  Type          SKU      Description   FullDescription        Size      Price
    Variable       2        Boots          Shoes on sale       S,M,XL        
    Variation      2.6      Boots S                             S       330
    Variation      2.7      Boots M                             M        330
    Variation      2.5      Boots XL                            XL        330
    Variable       3        Boots           Helmet Sizes       E41,E42
    Variation      3.2      Helmet E41                          E41       89
    Variation      3.8      Helmet E42                          E42       89

I am able to successfully get the results using this code

sizes, dig = ['S','M','XL','L',], ['000','111','333','222'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'Variable').cumsum()) 
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
df

The issue is that the given code dosen't work on dataframe

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,3XL       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots 3XL                            3XL        330
Variation      2.7      Boots S                              S        330
Variable       3        Helmet           Helmet Sizes      S19, S9
Variation      3.8      Helmet E42                          S19       89
Variation      3.2      Helmet E41                          S9       89

it gives the results 'S,3XL,XL' and 'S19,S9' whereas I want the results as

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       S,XL,3XL       
Variation      2.7      Boots S                             S          330
Variation      2.5      Boots XL                            XL        330
Variation      2.6      Boots 3XL                           3XL        330
Variable       3        Helmet           Helmet Sizes      S9,S19
Variation      3.2      Helmet E41                          S9        89
Variation      3.8      Helmet E42                          S19       89

also in case of more sizes, the order should be 'XXS,XS,S,M,L,XL,XXL,3XL,4XL,5XL' and in case of second example, 'S9,S19,M9,M19,L9 and so on'

This is what I have done so far but it's not working and showing the wrong order

sizes, dig = ['XS','S','M','L','XL','XXL','3XL','4XL','5XL'], ['000','111','222','333','444','555','666','777','888'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'variable').cumsum())
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)

Upvotes: 1

Views: 142

Answers (1)

Dieter
Dieter

Reputation: 2659

Step 1: recreate the data

import pandas as pd

#----------------------#
# Recreate the dataset #
#----------------------#
# raw input data_1 = """ Variable|2|Boots|Shoes on sale|XL,S,M|  
                         Variation|2.5|Boots XL||XL|330 Variation|2.6|Boots S||S|330 
                         Variation|2.7|Boots M||M|330 Variable|3|Helmet|Helmet Sizes|E42,E41| 
                         Variation|3.8|Helmet E42||E42|89 
                         Variation|3.2|Helmet E41||E41|89"""

data_2 = """ Variable|2|Boots|Shoes on sale|XL,S,3XL| 
             Variation|2.5|Boots XL||XL|330 
             Variation|2.6|Boots 3XL||3XL|330 
             Variation|2.7|Boots S||S|330 
             Variable|3|Helmet|Helmet Sizes|S19, S9| 
             Variation|3.8|Helmet E42||S19|89 
             Variation|3.2|Helmet E41||S9|89"""

# Construct 1 data set
data = 'Type|SKU|Description|FullDescription|Size|Price'
data += data_2 # this can also be data_1  or data_1 + data_2

# pre-process: split the lines and values into a list of lists.
data = [row.split('|') for row in data.split('\n')]

#-------------#
# create a df #
#-------------#
df = pd.DataFrame(data[1:], columns=data[0]) df

Temp Result

Type    SKU     Description     FullDescription          Size   Price
0   Variable    2               Boots   Shoes on sale   XL,S,3XL    
1   Variation   2.5             Boots XL                XL          330
2   Variation   2.6             Boots 3XL               3XL         330
3   Variation   2.7             Boots S                 S           330
4   Variable    3               Helmet  Helmet Sizes    S19, S9     
5   Variation   3.8             Helmet E42              S19         89
6   Variation   3.2             Helmet E41              S9          89

Step 2: Create a priority dict

I'm not really into fashion + I'm also a guy --> (I'm only familiar with S M L XL)
But feel free to re order them or add extra sizes into the list

# Prioritize the sizes
# ps, i don't know the order :) 
priority_dict = {k : e for e, k in enumerate([ 'XXS','XS','S','M','L','XL','XXL','3XL','4XL','5XL', 'E41', 'E42', 'S9', 'S19' ])}
priority_dict

Temp Result

{'XXS': 0,
 'XS': 1,
 'S': 2,
 'M': 3,
 'L': 4,
 'XL': 5,
 'XXL': 6,
 '3XL': 7,
 '4XL': 8,
 '5XL': 9,
 'E41': 10,
 'E42': 11,
 'S9': 12,
 'S19': 13}

Step 3: Create a list of tuples from the string of sizes

# Split the string  "SIZE" into a list    "XL,S,M" --> ["XL", "S", "M"]
# And, add the value from our priority dict to it  --> [(5, "XL"), (2, "S"), (3, "M")]
# Last but not least, sort list (by the first value) --> [(2, "S"), (3, "M"), (5, "XL")]
df["TMP_SIZE"] = [ sorted([(priority_dict.get(size.strip()), size.strip())  for size in sizes.split(',')]) for sizes in df.Size]
df

Temp Result

Type    SKU     Description     FullDescription          Size       Price  TMP_SIZE
0   Variable    2               Boots   Shoes on sale   XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]
1   Variation   2.5             Boots XL                XL          330    [(5, XL)]
2   Variation   2.6             Boots 3XL               3XL         330    [(7, 3XL)]
3   Variation   2.7             Boots S                 S           330    [(2, S)]
4   Variable    3               Helmet  Helmet Sizes    S19, S9            [(12, S9), (13, S19)]
5   Variation   3.8             Helmet E42              S19         89     [(13, S19)]
6   Variation   3.2             Helmet E41              S9          89     [(12, S9)]

Step 4: Clean TEMP_SIZE

# Create a new SIZE
# loop over the TMPS_SIZE and create a string from the second value of the tuplelist --> ', '.join( my_list )

df['NEW_SIZE'] = [', '.join([ size[1]for size in sizes ]) for sizes in df["TMP_SIZE"] ]

Temp Result

Type    SKU     Description     ...     Size        Price  TMP_SIZE                       NEW_SIZE
0   Variable    2               ...     XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]  S, XL, 3XL
1   Variation   2.5             ...     XL          330    [(5, XL)]                    XL
2   Variation   2.6             ...     3XL         330    [(7, 3XL)]                   3XL
3   Variation   2.7             ...     S           330    [(2, S)]                     S
4   Variable    3               ...     S19, S9            [(12, S9), (13, S19)]        S9, S19
5   Variation   3.8             ...     S19         89     [(13, S19)]                  S19
6   Variation   3.2             ...     S9          89     [(12, S9)]                   S9

Step 5: grp

add your grp

#grp
df['grp']= (df['Type'] == 'Variable').cumsum()
df

Step 6: sort

in the last step, you can sort everything (I think that you need to sort the TMP_SIZE separately)

# sort the dataset
df = df.sort_values('TMP_SIZE') # notice that we sort on the list of tuples
df.sort_values(by=['grp', 'Type'])

Upvotes: 2

Related Questions