Reputation: 201
I have a data frame with columns values having different duplicates in them and I want to remove these duplicates from the entire data frame which consists of thousands of rows.
Data looks like this in the excel file:
index ColumnA
0 6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" CAT TAIL
1 5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182 dog
2 1/2" Pipe, W/VALVE, Broken sides - packaging open, PIPE, Like NEW
3 6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC, Wedding Ring
4 5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging, 5' ladder
I have tried:
def removeduplicates(str):
t=""
for t in str:
if(i in t):
pass
else:
t =t+1
also
df['columnA'].apply(lambda cell: set([c.strip() for c in cell.strip(', ')]))
But both methods don't work for this case.
Required output:
ColumnA
6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" TAIL
5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182
1/2" Pipe, W/VALVE, Broken sides - packaging open, Like NEW
6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC
5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging
Data file: https://1drv.ms/x/s!ArCp0UbnlDoughmn3Io9aOvhNykZ?e=vZQXdC
I have already tried drop duplicates etc. I do not want to drop rows. I do not want to drop columns. I have read this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html but could not find my answer.
Upvotes: 0
Views: 163
Reputation: 16683
,
. The inner list is a list of words per phrase split by a space
. Then, loop through each row and each list of lists to remove words that are duplicated (case-insensitive with using lower()
).lst2.append([' '.join(sl1) for sl1 in lst])
. When we run the that line of code, it joins the words together into phrases again, now with the removed duplicate words, so now we just have a list of lists where the outer list is the rows and the inner list is the phrases for each row.df['ColumnA'] = lst2
and then join the inner list with ,
to join the the phrases all back into one string again. Finally, use .replace
to do some final cleanup that is required from some of the words that were removed.df = pd.DataFrame({'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'ColumnA': {0: '6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" CAT TAIL',
1: "5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182 dog",
2: '1/2" Pipe, W/VALVE, Broken sides - packaging open, PIPE, Like NEW',
3: '6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC, Wedding Ring',
4: "5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging, 5' ladder"}})
df['ColumnA'] = (df['ColumnA'].str.split(', ').apply(lambda x: [y.split() for y in x]))
lst, lst2 = [], []
for i in df['ColumnA']:
for j in i: lst.append([k for k in j if k.lower()
not in [sl2.lower() for sl1 in lst for sl2 in sl1]])
lst2.append([' '.join(sl1) for sl1 in lst])
lst = []
df['ColumnA'] = lst2
df['ColumnA'] = df['ColumnA'].apply(lambda x: ', '.join(x)).str.replace(' , ', ' ').replace(', $','', regex=True)
df
Out[1]:
index ColumnA
0 0 6-1/2" CAT, SMELLS, BAD, XS, A-403 -316L, 4" TAIL
1 1 5-1/2' DOG, ROUND HEAD, SLIM, 60 LB, A-182
2 2 1/2" Pipe, W/VALVE, Broken sides - packaging open, Like NEW
3 3 6" WEDDING RING, 1 ct, RF, 1/2" WIDE, Diamond MISC
4 4 5' Ladder, 50LB, new, 1/2' STEPS, 316L -, with packaging
Upvotes: 1