How to remove the duplicates excel column values in python while keeping the original order?

Question

I have a data frame with columns values having different duplicates in them and I want to remove these duplicates from the entire data frame which consists of thousands of rows.

Data looks like this in the excel file:


index  ColumnA
0      6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" CAT TAIL
1      5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182 dog
2      1/2" Pipe, W/VALVE, Broken sides - packaging open, PIPE, Like NEW
3      6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC, Wedding Ring
4      5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging, 5' ladder

I have tried:

def removeduplicates(str):
     t=""
     for t in str:
         if(i in t):
            pass
         else:
             t =t+1

also

df['columnA'].apply(lambda cell: set([c.strip() for c in cell.strip(', ')]))

But both methods don't work for this case.

Required output:

ColumnA
6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" TAIL
5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182
1/2" Pipe, W/VALVE, Broken sides - packaging open, Like NEW
6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC 
5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging

Data file: https://1drv.ms/x/s!ArCp0UbnlDoughmn3Io9aOvhNykZ?e=vZQXdC

I have already tried drop duplicates etc. I do not want to drop rows. I do not want to drop columns. I have read this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html but could not find my answer.

David Erickson · Accepted Answer

You can split the column to a list of lists. Per row, the outer list is the list of phrases, split by a comma+space , . The inner list is a list of words per phrase split by a space . Then, loop through each row and each list of lists to remove words that are duplicated (case-insensitive with using lower()).
The result of 1 will create a list of list of lists up to this line of code: lst2.append([' '.join(sl1) for sl1 in lst]). When we run the that line of code, it joins the words together into phrases again, now with the removed duplicate words, so now we just have a list of lists where the outer list is the rows and the inner list is the phrases for each row.
Set the column ot the list of lists with df['ColumnA'] = lst2 and then join the inner list with , to join the the phrases all back into one string again. Finally, use .replace to do some final cleanup that is required from some of the words that were removed.

df = pd.DataFrame({'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
 'ColumnA': {0: '6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" CAT TAIL',
  1: "5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182 dog",
  2: '1/2" Pipe, W/VALVE, Broken sides - packaging open, PIPE, Like NEW',
  3: '6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC, Wedding Ring',
  4: "5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging, 5' ladder"}})


df['ColumnA'] = (df['ColumnA'].str.split(', ').apply(lambda x: [y.split() for y in x]))

lst, lst2 = [], []
for i in df['ColumnA']:
    for j in i: lst.append([k for k in j if k.lower()
                            not in [sl2.lower() for sl1 in lst for sl2 in sl1]])
    lst2.append([' '.join(sl1) for sl1 in lst])
    lst = []
df['ColumnA'] = lst2
df['ColumnA'] = df['ColumnA'].apply(lambda x: ', '.join(x)).str.replace(' , ', ' ').replace(', $','', regex=True)
df
Out[1]: 
   index                                                      ColumnA
0  0      6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" TAIL          
1  1      5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182                 
2  2      1/2" Pipe, W/VALVE, Broken sides - packaging open, Like NEW
3  3      6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC         
4  4      5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging

How to remove the duplicates excel column values in python while keeping the original order?

Answers (1)

Related Questions