Removing duplicate rows in dataframe in python

Question

I have a dataframe which has 27949 rows & 7 columns & the first few rows look like below https://i.sstatic.net/1Pipf.png

Task: In the dataframe I have a 'title' column which has many duplicate titles which I want to remove (duplicate title:almost all the title is same except for 1 or 2 words). Pseudo code: I want to check the 1st row with all other rows & if any of these is a duplicate I want to remove it. Then I want to check the 2nd row with all other rows & if any of these is a duplicate I want to remove it - similarly with all rows i.e. i = 1st line to last line j = i+1 to last line. My code:

for i in range(0,27950):
    for j in range(1,27950):
        a = data_sorted['title'].iloc[i].split()
        b = data_sorted['title'].iloc[j].split()
        if len(a)-len(b)<=2:
            data_sorted.drop(b)
            j=j
        else:
            j+=1
    i+=1

Error: IndexError: single positional indexer is out-of-bounds

Can anyone please help me out with my code. Thanks in advance.

Removing duplicate rows in dataframe in python

Answers (1)

Related Questions