Reputation: 41
This has been puzzling mew for a while. I have the following data set denoted under raw data
, and have run two checks, #1 to identify a sample duplicate, and #2 to remove duplicates with drop_duplicates
. The #1 test does identify duplicates, yet #2 does not seem to remove any duplicates.
raw_data = {'link':
['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}
df = pd.DataFrame(raw_data, columns = ["link"])
#duplicate check #1
a = df.iloc[12][0]
b = df.iloc[13][0]
if a == b:
print("equal")
#duplicate check #2
df.drop_duplicates(['link'], keep='first')
Output:
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
link
0 https://www.otodom.pl/oferta/mieszkanie-w-spok...
1 https://www.otodom.pl/oferta/mieszkanie-w-spok...
2 https://www.otodom.pl/oferta/mieszkanie-w-spok...
3 https://www.otodom.pl/oferta/mieszkanie-w-spok...
4 https://www.otodom.pl/oferta/zielony-widok-mie...
5 https://www.otodom.pl/oferta/zielony-widok-mie...
6 https://www.otodom.pl/oferta/nowoczesne-osiedl...
7 https://www.otodom.pl/oferta/nowoczesne-osiedl...
8 https://www.otodom.pl/oferta/nowoczesne-osiedl...
9 https://www.otodom.pl/oferta/nowoczesne-osiedl...
10 https://www.otodom.pl/oferta/mieszkanie-56-m-w...
11 https://www.otodom.pl/oferta/mieszkanie-56-m-w...
12 https://www.otodom.pl/oferta/idealny-2pok-apar...
13 https://www.otodom.pl/oferta/idealny-2pok-apar...
Help would appreciated with reasoning why duplicates do not drop, thanks!
Upvotes: 1
Views: 37
Reputation: 3097
The links provided are not same.
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X
.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x
.html#64f19d3152
In one link it is X
and in other it is x
Also variable a
and b
are always None
so it print equal
Upvotes: 1
Reputation: 36623
You have to reassign the output of drop_duplicates
either to df
or to a new variable. It does not happen in-place.
df2 = df.drop_duplicates(['link'], keep='first')
Upvotes: 1