Most efficient way to compare two near identical CSV's in Python?

Question

I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.

Hiram Foster · Accepted Answer

Are you using pandas?

import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)

# array indicating which rows are duplicated
df[df.duplicated()]

# dataframe with only unique rows
df[~df.duplicated()]

# dataframe with only duplicate rows
df[df.duplicated()]

# number of duplicate rows present
df.duplicated().sum()

Most efficient way to compare two near identical CSV's in Python?

Answers (2)

Related Questions

Most efficient way to compare two near identical CSV&#39;s in Python?

Answers (2)

Related Questions

Most efficient way to compare two near identical CSV's in Python?