Reputation: 443
I am having two CSV files(approx 4GB each) and I want to check the difference between the entries of these two files.
Suppose Row1 entries in 1.csv doesn't match with row1 of 2.csv but identical to row 100 of 2.csv then it shouldn't show any difference.
Difference should be seen only when there are no identical entries in both of the CSV file. Constraint is can't use any database.
I am using dask.Dataframe to input this files, but I don't see any api's or function to find the difference in Dask documentations.
I even haven't find anyway to convert Dask Dataframes into Panda Dataframes, neither I am able to convert this Dataframes into any of text or CSV file.
Is there any solution for this to compare such huge files and find out the difference.
Please find the sample code I have tried.
import dask.dataframe as dd
import numpy.testing as npt
import pandas as pd
filename1 = '/Users/saikatbhattacharjee/Downloads/2008.csv'
df1 = dd.read_csv(filename1, assume_missing=True)
filename2 = '/Users/saikatbhattacharjee/Downloads/2009.csv'
df2 = dd.read_csv(filename2, assume_missing=True )
def assert_frames_equal(actual, expected, use_close=False):
"""
Compare DataFrame items by index and column and
raise AssertionError if any item is not equal.
Ordering is unimportant, items are compared only by label.
NaN and infinite values are supported.
Parameters
----------
actual : pandas.DataFrame
expected : pandas.DataFrame
use_close : bool, optional
If True, use numpy.testing.assert_allclose instead of
numpy.testing.assert_equal.
"""
if use_close:
comp = npt.assert_allclose
else:
comp = npt.assert_equal
assert (isinstance(actual, pd.DataFrame) and
isinstance(expected, pd.DataFrame)), \
'Inputs must both be pandas DataFrames.'
for i, exp_row in expected.iterrows():
assert i in actual.index, 'Expected row {!r} not
found.'.format(i)
act_row = actual.loc[i]
for j, exp_item in exp_row.iteritems():
assert j in act_row.index, \
'Expected column {!r} not found.'.format(j)
act_item = act_row[j]
if comp(act_item, exp_item):
print("CSV files are identical")
else:
print('The difference in CSV files are'.format(j, i))
actual = pd.DataFrame(df1)
expected = pd.Dataframe(df2)
assert_frames_equal(actual, expected)
I am getting below error on this:
File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/saikatbhattacharjee/.spyder-py3/temp.py", line 52, in <module>
actual = pd.DataFrame(df1)
File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 354, in __init__
raise ValueError('DataFrame constructor not properly called!')
ValueError: DataFrame constructor not properly called!
Upvotes: 3
Views: 1875
Reputation: 28683
A couple of quick answers (and I would say this is pretty well documented)
actual = df1.compute()
(make sure you have enough memory to do this!)
actual.to_csv('myfiles*.csv')
This writes one file per partition, because multiple threads/processes cannot write to a single file simultaneously in any sensible manner; you can concatenate the files (removing the header, or having used header=False
) into one, if you wish.
Upvotes: 1