Charles
Charles

Reputation: 201

How to remove the duplicates excel column values in python while keeping the original order?

I have a data frame with columns values having different duplicates in them and I want to remove these duplicates from the entire data frame which consists of thousands of rows.

Data looks like this in the excel file:


index  ColumnA
0      6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" CAT TAIL
1      5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182 dog
2      1/2" Pipe, W/VALVE, Broken sides - packaging open, PIPE, Like NEW
3      6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC, Wedding Ring
4      5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging, 5' ladder


I have tried:

def removeduplicates(str):
     t=""
     for t in str:
         if(i in t):
            pass
         else:
             t =t+1

also

df['columnA'].apply(lambda cell: set([c.strip() for c in cell.strip(', ')]))

But both methods don't work for this case.

Required output:

ColumnA
6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" TAIL
5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182
1/2" Pipe, W/VALVE, Broken sides - packaging open, Like NEW
6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC 
5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging

Data file: https://1drv.ms/x/s!ArCp0UbnlDoughmn3Io9aOvhNykZ?e=vZQXdC

I have already tried drop duplicates etc. I do not want to drop rows. I do not want to drop columns. I have read this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html but could not find my answer.

Upvotes: 0

Views: 163

Answers (1)

David Erickson
David Erickson

Reputation: 16683

  1. You can split the column to a list of lists. Per row, the outer list is the list of phrases, split by a comma+space , . The inner list is a list of words per phrase split by a space . Then, loop through each row and each list of lists to remove words that are duplicated (case-insensitive with using lower()).
  2. The result of 1 will create a list of list of lists up to this line of code: lst2.append([' '.join(sl1) for sl1 in lst]). When we run the that line of code, it joins the words together into phrases again, now with the removed duplicate words, so now we just have a list of lists where the outer list is the rows and the inner list is the phrases for each row.
  3. Set the column ot the list of lists with df['ColumnA'] = lst2 and then join the inner list with , to join the the phrases all back into one string again. Finally, use .replace to do some final cleanup that is required from some of the words that were removed.

df = pd.DataFrame({'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
 'ColumnA': {0: '6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" CAT TAIL',
  1: "5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182 dog",
  2: '1/2" Pipe, W/VALVE, Broken sides - packaging open, PIPE, Like NEW',
  3: '6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC, Wedding Ring',
  4: "5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging, 5' ladder"}})


df['ColumnA'] = (df['ColumnA'].str.split(', ').apply(lambda x: [y.split() for y in x]))

lst, lst2 = [], []
for i in df['ColumnA']:
    for j in i: lst.append([k for k in j if k.lower()
                            not in [sl2.lower() for sl1 in lst for sl2 in sl1]])
    lst2.append([' '.join(sl1) for sl1 in lst])
    lst = []
df['ColumnA'] = lst2
df['ColumnA'] = df['ColumnA'].apply(lambda x: ', '.join(x)).str.replace(' , ', ' ').replace(', $','', regex=True)
df
Out[1]: 
   index                                                      ColumnA
0  0      6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" TAIL          
1  1      5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182                 
2  2      1/2" Pipe, W/VALVE, Broken sides - packaging open, Like NEW
3  3      6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC         
4  4      5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging

Upvotes: 1

Related Questions