Reputation: 11
I have a dataframe which has 27949 rows & 7 columns & the first few rows look like below https://i.sstatic.net/1Pipf.png
Task: In the dataframe I have a 'title' column which has many duplicate titles which I want to remove (duplicate title:almost all the title is same except for 1 or 2 words). Pseudo code: I want to check the 1st row with all other rows & if any of these is a duplicate I want to remove it. Then I want to check the 2nd row with all other rows & if any of these is a duplicate I want to remove it - similarly with all rows i.e. i = 1st line to last line j = i+1 to last line. My code:
for i in range(0,27950):
for j in range(1,27950):
a = data_sorted['title'].iloc[i].split()
b = data_sorted['title'].iloc[j].split()
if len(a)-len(b)<=2:
data_sorted.drop(b)
j=j
else:
j+=1
i+=1
Error: IndexError: single positional indexer is out-of-bounds
Can anyone please help me out with my code. Thanks in advance.
Upvotes: 1
Views: 153
Reputation: 154
I would suggest the following approach:
Build a difference matrix of your title, where the i,j element will represent the word difference between i'th and j'th title.
Like so:
import numpy as np
from itertools import product
l = list(data_sorted['title'])
def diff_words(text_1, text_2):
# return the number of different words between two texts
words_1 = text_1.split()
words_2 = text_2.split()
diff = max(len(words_1),len(words_2))-len(np.intersect1d(words_1, words_2))
return diff
differences = [diff_words(i,j) for i,j in product(l,l)]
# differences: a flat matrix integers where the i,j element is the word difference between titles i and j
Upvotes: 1