Jordan Violet
Jordan Violet

Reputation: 73

Most efficient way to compare two near identical CSV's in Python?

I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.

Upvotes: 0

Views: 211

Answers (2)

Hiram Foster
Hiram Foster

Reputation: 128

Are you using pandas?

import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)

# array indicating which rows are duplicated
df[df.duplicated()]

# dataframe with only unique rows
df[~df.duplicated()]

# dataframe with only duplicate rows
df[df.duplicated()]

# number of duplicate rows present
df.duplicated().sum()

Upvotes: 2

v.coder
v.coder

Reputation: 1932

An efficient way would be to read each line from the first file(with less number of lines) and save in an object like Set or Dictionary, where you can access using O(1) complexity.

And then read lines from the second file and check if it exists in the Set or not.

Upvotes: 1

Related Questions