Saikat
Saikat

Reputation: 443

How can I compare two large CSV files using Dask

I am having two CSV files(approx 4GB each) and I want to check the difference between the entries of these two files.

Suppose Row1 entries in 1.csv doesn't match with row1 of 2.csv but identical to row 100 of 2.csv then it shouldn't show any difference.

Difference should be seen only when there are no identical entries in both of the CSV file. Constraint is can't use any database.

I am using dask.Dataframe to input this files, but I don't see any api's or function to find the difference in Dask documentations.

I even haven't find anyway to convert Dask Dataframes into Panda Dataframes, neither I am able to convert this Dataframes into any of text or CSV file.

Is there any solution for this to compare such huge files and find out the difference.

Please find the sample code I have tried.

import dask.dataframe as dd
import numpy.testing as npt
import pandas as pd

filename1 = '/Users/saikatbhattacharjee/Downloads/2008.csv'
df1 = dd.read_csv(filename1, assume_missing=True)
filename2 = '/Users/saikatbhattacharjee/Downloads/2009.csv'
df2 = dd.read_csv(filename2, assume_missing=True )

def assert_frames_equal(actual, expected, use_close=False):
    """
    Compare DataFrame items by index and column and
    raise AssertionError if any item is not equal.

    Ordering is unimportant, items are compared only by label.
    NaN and infinite values are supported.

    Parameters
    ----------
    actual : pandas.DataFrame
    expected : pandas.DataFrame
    use_close : bool, optional
    If True, use numpy.testing.assert_allclose instead of
    numpy.testing.assert_equal.

    """
    if use_close:
        comp = npt.assert_allclose
    else:
        comp = npt.assert_equal

    assert (isinstance(actual, pd.DataFrame) and
            isinstance(expected, pd.DataFrame)), \
        'Inputs must both be pandas DataFrames.'

    for i, exp_row in expected.iterrows():
        assert i in actual.index, 'Expected row {!r} not 
        found.'.format(i)

        act_row = actual.loc[i]

        for j, exp_item in exp_row.iteritems():
            assert j in act_row.index, \
                'Expected column {!r} not found.'.format(j)

            act_item = act_row[j]

            if comp(act_item, exp_item):
               print("CSV files are identical") 
            else:
                print('The difference in CSV files are'.format(j, i))

actual = pd.DataFrame(df1)
expected = pd.Dataframe(df2)
assert_frames_equal(actual, expected)

I am getting below error on this:

File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/saikatbhattacharjee/.spyder-py3/temp.py", line 52, in <module>
    actual = pd.DataFrame(df1)

  File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 354, in __init__
    raise ValueError('DataFrame constructor not properly called!')

ValueError: DataFrame constructor not properly called!

Upvotes: 3

Views: 1875

Answers (1)

mdurant
mdurant

Reputation: 28683

A couple of quick answers (and I would say this is pretty well documented)

To convert a dask dataframe to a pandas one:

actual = df1.compute()

(make sure you have enough memory to do this!)

To write data out to CSV

actual.to_csv('myfiles*.csv')

This writes one file per partition, because multiple threads/processes cannot write to a single file simultaneously in any sensible manner; you can concatenate the files (removing the header, or having used header=False) into one, if you wish.

Upvotes: 1

Related Questions