Reputation: 693
I have a dataframe (the sample looks like this)
Type SKU Description FullDescription Size Price
Variable 2 Boots Shoes on sale XL,S,M
Variation 2.5 Boots XL XL 330
Variation 2.6 Boots S S 330
Variation 2.7 Boots M M 330
Variable 3 Helmet Helmet Sizes E42,E41
Variation 3.8 Helmet E42 E42 89
Variation 3.2 Helmet E41 E41 89
What I want to do is sort the values based on Size so the final data frame should look like this:
Type SKU Description FullDescription Size Price
Variable 2 Boots Shoes on sale S,M,XL
Variation 2.6 Boots S S 330
Variation 2.7 Boots M M 330
Variation 2.5 Boots XL XL 330
Variable 3 Boots Helmet Sizes E41,E42
Variation 3.2 Helmet E41 E41 89
Variation 3.8 Helmet E42 E42 89
I am able to successfully get the results using this code
sizes, dig = ['S','M','XL','L',], ['000','111','333','222'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
.assign(grp=(df['Type'] == 'Variable').cumsum())
.sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
df
The issue is that the given code dosen't work on dataframe
Type SKU Description FullDescription Size Price
Variable 2 Boots Shoes on sale XL,S,3XL
Variation 2.5 Boots XL XL 330
Variation 2.6 Boots 3XL 3XL 330
Variation 2.7 Boots S S 330
Variable 3 Helmet Helmet Sizes S19, S9
Variation 3.8 Helmet E42 S19 89
Variation 3.2 Helmet E41 S9 89
it gives the results 'S,3XL,XL
' and 'S19,S9
' whereas I want the results as
Type SKU Description FullDescription Size Price
Variable 2 Boots Shoes on sale S,XL,3XL
Variation 2.7 Boots S S 330
Variation 2.5 Boots XL XL 330
Variation 2.6 Boots 3XL 3XL 330
Variable 3 Helmet Helmet Sizes S9,S19
Variation 3.2 Helmet E41 S9 89
Variation 3.8 Helmet E42 S19 89
also in case of more sizes, the order should be 'XXS,XS,S,M,L,XL,XXL,3XL,4XL,5XL
' and in case of second example, 'S9,S19,M9,M19,L9 and so on
'
This is what I have done so far but it's not working and showing the wrong order
sizes, dig = ['XS','S','M','L','XL','XXL','3XL','4XL','5XL'], ['000','111','222','333','444','555','666','777','888'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
.assign(grp=(df['Type'] == 'variable').cumsum())
.sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
Upvotes: 1
Views: 142
Reputation: 2659
import pandas as pd
#----------------------#
# Recreate the dataset #
#----------------------#
# raw input data_1 = """ Variable|2|Boots|Shoes on sale|XL,S,M|
Variation|2.5|Boots XL||XL|330 Variation|2.6|Boots S||S|330
Variation|2.7|Boots M||M|330 Variable|3|Helmet|Helmet Sizes|E42,E41|
Variation|3.8|Helmet E42||E42|89
Variation|3.2|Helmet E41||E41|89"""
data_2 = """ Variable|2|Boots|Shoes on sale|XL,S,3XL|
Variation|2.5|Boots XL||XL|330
Variation|2.6|Boots 3XL||3XL|330
Variation|2.7|Boots S||S|330
Variable|3|Helmet|Helmet Sizes|S19, S9|
Variation|3.8|Helmet E42||S19|89
Variation|3.2|Helmet E41||S9|89"""
# Construct 1 data set
data = 'Type|SKU|Description|FullDescription|Size|Price'
data += data_2 # this can also be data_1 or data_1 + data_2
# pre-process: split the lines and values into a list of lists.
data = [row.split('|') for row in data.split('\n')]
#-------------#
# create a df #
#-------------#
df = pd.DataFrame(data[1:], columns=data[0]) df
Type SKU Description FullDescription Size Price
0 Variable 2 Boots Shoes on sale XL,S,3XL
1 Variation 2.5 Boots XL XL 330
2 Variation 2.6 Boots 3XL 3XL 330
3 Variation 2.7 Boots S S 330
4 Variable 3 Helmet Helmet Sizes S19, S9
5 Variation 3.8 Helmet E42 S19 89
6 Variation 3.2 Helmet E41 S9 89
I'm not really into fashion + I'm also a guy --> (I'm only familiar with S M L XL)
But feel free to re order them or add extra sizes into the list
# Prioritize the sizes
# ps, i don't know the order :)
priority_dict = {k : e for e, k in enumerate([ 'XXS','XS','S','M','L','XL','XXL','3XL','4XL','5XL', 'E41', 'E42', 'S9', 'S19' ])}
priority_dict
{'XXS': 0,
'XS': 1,
'S': 2,
'M': 3,
'L': 4,
'XL': 5,
'XXL': 6,
'3XL': 7,
'4XL': 8,
'5XL': 9,
'E41': 10,
'E42': 11,
'S9': 12,
'S19': 13}
# Split the string "SIZE" into a list "XL,S,M" --> ["XL", "S", "M"]
# And, add the value from our priority dict to it --> [(5, "XL"), (2, "S"), (3, "M")]
# Last but not least, sort list (by the first value) --> [(2, "S"), (3, "M"), (5, "XL")]
df["TMP_SIZE"] = [ sorted([(priority_dict.get(size.strip()), size.strip()) for size in sizes.split(',')]) for sizes in df.Size]
df
Type SKU Description FullDescription Size Price TMP_SIZE
0 Variable 2 Boots Shoes on sale XL,S,3XL [(2, S), (5, XL), (7, 3XL)]
1 Variation 2.5 Boots XL XL 330 [(5, XL)]
2 Variation 2.6 Boots 3XL 3XL 330 [(7, 3XL)]
3 Variation 2.7 Boots S S 330 [(2, S)]
4 Variable 3 Helmet Helmet Sizes S19, S9 [(12, S9), (13, S19)]
5 Variation 3.8 Helmet E42 S19 89 [(13, S19)]
6 Variation 3.2 Helmet E41 S9 89 [(12, S9)]
# Create a new SIZE
# loop over the TMPS_SIZE and create a string from the second value of the tuplelist --> ', '.join( my_list )
df['NEW_SIZE'] = [', '.join([ size[1]for size in sizes ]) for sizes in df["TMP_SIZE"] ]
Type SKU Description ... Size Price TMP_SIZE NEW_SIZE
0 Variable 2 ... XL,S,3XL [(2, S), (5, XL), (7, 3XL)] S, XL, 3XL
1 Variation 2.5 ... XL 330 [(5, XL)] XL
2 Variation 2.6 ... 3XL 330 [(7, 3XL)] 3XL
3 Variation 2.7 ... S 330 [(2, S)] S
4 Variable 3 ... S19, S9 [(12, S9), (13, S19)] S9, S19
5 Variation 3.8 ... S19 89 [(13, S19)] S19
6 Variation 3.2 ... S9 89 [(12, S9)] S9
add your grp
#grp
df['grp']= (df['Type'] == 'Variable').cumsum()
df
in the last step, you can sort everything (I think that you need to sort the TMP_SIZE separately)
# sort the dataset
df = df.sort_values('TMP_SIZE') # notice that we sort on the list of tuples
df.sort_values(by=['grp', 'Type'])
Upvotes: 2