goldsilvy
goldsilvy

Reputation: 41

Python - data frame - cannot remove duplicates

This has been puzzling mew for a while. I have the following data set denoted under raw data, and have run two checks, #1 to identify a sample duplicate, and #2 to remove duplicates with drop_duplicates. The #1 test does identify duplicates, yet #2 does not seem to remove any duplicates.

raw_data = {'link':
           ['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
            'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
            'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
            'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
            'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
            'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
            'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}

df = pd.DataFrame(raw_data, columns = ["link"])

#duplicate check #1

a = df.iloc[12][0]
b = df.iloc[13][0]

if a == b:
    print("equal")

#duplicate check #2

df.drop_duplicates(['link'], keep='first')

Output:

https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
link
0   https://www.otodom.pl/oferta/mieszkanie-w-spok...
1   https://www.otodom.pl/oferta/mieszkanie-w-spok...
2   https://www.otodom.pl/oferta/mieszkanie-w-spok...
3   https://www.otodom.pl/oferta/mieszkanie-w-spok...
4   https://www.otodom.pl/oferta/zielony-widok-mie...
5   https://www.otodom.pl/oferta/zielony-widok-mie...
6   https://www.otodom.pl/oferta/nowoczesne-osiedl...
7   https://www.otodom.pl/oferta/nowoczesne-osiedl...
8   https://www.otodom.pl/oferta/nowoczesne-osiedl...
9   https://www.otodom.pl/oferta/nowoczesne-osiedl...
10  https://www.otodom.pl/oferta/mieszkanie-56-m-w...
11  https://www.otodom.pl/oferta/mieszkanie-56-m-w...
12  https://www.otodom.pl/oferta/idealny-2pok-apar...
13  https://www.otodom.pl/oferta/idealny-2pok-apar...

Help would appreciated with reasoning why duplicates do not drop, thanks!

Upvotes: 1

Views: 37

Answers (2)

Prince Francis
Prince Francis

Reputation: 3097

The links provided are not same. https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152

https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152

In one link it is X and in other it is x

Also variable a and b are always None so it print equal

Upvotes: 1

James
James

Reputation: 36623

You have to reassign the output of drop_duplicates either to df or to a new variable. It does not happen in-place.

df2 = df.drop_duplicates(['link'], keep='first')

Upvotes: 1

Related Questions