Reputation: 73
I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.
Upvotes: 0
Views: 211
Reputation: 128
Are you using pandas?
import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)
# array indicating which rows are duplicated
df[df.duplicated()]
# dataframe with only unique rows
df[~df.duplicated()]
# dataframe with only duplicate rows
df[df.duplicated()]
# number of duplicate rows present
df.duplicated().sum()
Upvotes: 2
Reputation: 1932
An efficient way would be to read each line from the first file(with less number of lines) and save in an object like Set or Dictionary, where you can access using O(1) complexity.
And then read lines from the second file and check if it exists in the Set or not.
Upvotes: 1