goldsilvy
goldsilvy

Reputation: 41

Pandas DataFrame - duplicated() does not identify duplicate values

EDIT: I have stripped down the file to the bits that are problematic

raw_data = {"link":
           ['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
            'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
            'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
            'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
            'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
            'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
            'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
            'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
            'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}

df = pd.DataFrame(raw_data, columns = ["link"])

#duplicate check #1

a = print(df.iloc[12][0])
b = print(df.iloc[13][0])

if a == b:
    print("equal")

#duplicate check #2

df.duplicated()

For the first test I get the following output implying there is a duplicate

https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal

For the second test it seems there are no duplicates

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool

Original post:

Trying to identify duplicate values from the "Link" column of attached file:

data file

import pandas as pd

data = pd.read_csv(r"...\consolidated.csv", sep=",")

df = pd.DataFrame(data)

del df['Unnamed: 0']

duplicate_rows = df[df.duplicated(["Link"], keep="first")]

pd.DataFrame(duplicate_rows)

#a = print(df.iloc[42657][15])
#b = print(df.iloc[42676][15])

#if a == b:
#    print("equal")

Used the code above, but the answer I keep getting is that there are no duplicates. Checked it through Excel and there should be seven duplicate instances. Even selected specific cells to do a quick check (the part marked with #s), and the values have been identified as equal. Yet duplicated does not capture them

I have been scratching my head for a good hour, and still no idea what I'm missing - help appreciated!

Upvotes: 1

Views: 2484

Answers (2)

LWNirvana
LWNirvana

Reputation: 57

I had the same problem and converting the columns of the dataframe to "str" helped.

eg.

df['link'] = df['link'].astype(str)
duplicate_rows = df[df.duplicated(["link"], keep="first")]

Upvotes: 1

dvaraujo
dvaraujo

Reputation: 46

First, you don't need df = pd.DataFrame(data), as data = pd.read_csv(r"...\consolidated.csv", sep=",") already returns a Dataframe.

As for the deletion of duplicates, check the drop_duplicates method in the Documentation

Hope this helps.

Upvotes: 0

Related Questions