ChrizZlyBear
ChrizZlyBear

Reputation: 193

How to find duplicates in a pandas Dataframe

I want to read a folder with some .csv files in it and find duplicate coordinates. The .csv looks like this:

0 0 0 1 1 
0 1 2 1 1 
0 0 0 1 2
...

Here would be rows 0 and 2 duplicates, as the first 3 columns (the coordinates) are the same.

I thought maybe somehow sorting the dataframe before comparing it would speed up the code. But i am not sure how to sort it correctly in python (I would sort it by the first column. Then for each element which is the same in the first column the second one and the same on the third. So the dataframe:

0 1 1 1 1
0 1 0 1 2
2 0 1 0 0
0 0 0 1 1
would look like this: 
0 0 0 1 1
0 1 0 1 2
0 1 1 1 1
2 0 1 0 0

My code so far looks like this:

import pandas as pd
import glob
import numpy as np
from tkinter import filedialog

path = filedialog.askdirectory(title="Select Coordinate File")
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)
# Sort frame
# Compare rows

Upvotes: 0

Views: 482

Answers (1)

jottbe
jottbe

Reputation: 4521

You could use groupby. Like:

deduplicated_df= df.groupby(['NameCol1', 'NameCol2', 'NameCol3']).aggregate('first')

In this case you would get one line per each combination of the three columns. The other column values are taken from the first record with the same value combination in the first 3 rows.

This makes the first 3 columns index columns. If you need them as regular columns, just do:

deduplicated_df.reset_index(inplace=True)

Oh, I reread. Not sure, if you just want to eliminate the duplicates (that's what the method above does), or print the duplicate coordinates. In the latter case, you can do something similar as above. I guess you only need the coordinates, right?

In that case you can group again and produce a count column ('NameCol4' can be any existing column of your df) and then select all lines where the count is greather than 1:

deduplicated_df= df.groupby(['NameCol1', 'NameCol2', 'NameCol3']).aggregate({'NameCol4': 'count'})
deduplicated_df.reset_index(inplace=True)
deduplicated_df[deduplicated_df['NameCol4']>1]

Upvotes: 1

Related Questions