What's the best way to unit test functions that handle csv files?

Question

I'm trying to unit-test a function that deals with csv files with Pytest. While my function works, I feel like there's a lot of code repetition when creating "sample" csv files in my project directory to test the function. The actual csv file that holds the real data has millions of records.

These are not the only csv files I have to test in my module, so it would be immensely helpful to know what's the best way to test functions that work with different file structures.

Right now I'm creating a very short csv file that mimics the actual file schema with a single line of data plus expected dataframe output after the file is processed through the function.

Perhaps mocking is the way to go? But I feel like you shouldn't need to mock for this kind of testing

Test Function

@pytest.mark.parametrize('test_file, expected', [
    (r'Path\To\Project\Output\Folder\mock_sales1.csv',
     pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales'])),
    (r'Path\To\Project\Output\Folder\mock_sales2.csv',
     pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales']))
])
def test_sales_dataframe(test_file, expected):
    # This part is repetitive, different tests each need a seperate file written within the test function.
    # Writing sample file to test that files with 7 columns are read correctly.
    mock_mks_sales1 = [['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 4000]]
    with open(r'Path\To\Project\Output\Folder\mock_sales1.csv', 'w') as file:
        writer = csv.writer(file)
        writer.writerows(mock_sales1)
    # Writing sample file to test that files with 8 columns are read correctly.
    mock_mks_sales2 = [['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 'Data6', 4000]]
    with open(r'Path\To\Project\Output\Folder\mock_sales2.csv', 'w') as file:
        writer = csv.writer(file)
        writer.writerows(mock_sales2)

    sales_df = mks_sales_dataframe(test_file)
    testing.assert_frame_equal(expected, sales_df)

    os.remove(r'Path\To\Project\Output\Folder\mock_sales1.csv')
    os.remove(r'Path\To\Project\Output\Folder\mock_sales2.csv')

Main Function

def sales_dataframe(file):
    try:
        with open(file, 'r') as f:
            reader = csv.reader(f)
            num_cols = len(next(reader))
            columns = [1, 2, (num_cols - 1)]  # Number of columns is variable, this is used later to accurately specify which columns should be read. This is part I'm testing!

        sales_df = pd.read_csv(file, usecols=columns, names=['Postal_Code', 'Store_Num', 'Sales'])
        return sales_df
    except FileNotFoundError:
        raise FileNotFoundError(file)

The test passes as intended. However, for every different test I have to create a sample csv file within the test function and delete each file once the test is finished. As you can imagine that's a lot of repetitive code within a single test function which feels quite clunky and wordy, especially when the test is parameterized.

Arnaud Claudel · Accepted Answer

I think the problem is that your test input and expected output are strongly tied but located at two different places, one in the parameters and the other in the test code.
If you change one parameter, you'll need to change the method body of your test which is not right imo, in addition of the duplicated code.

I think that you should have the parameters test(test_data, expected output) and inject the input in a temporary file.
Then you call your function and compare the expected and actual output.

@pytest.mark.parametrize('test_data, expected', [
    ([['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 4000]],
      pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales'])),
    ([['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 'Data6', 4000]],
      pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales']))
])
def test_sales_dataframe(test_data, expected):

    # Write your test data in a temporary file
    tmp_file = r'Path\To\Project\Output\Folder	mp.csv';
    with open(tmp_file, 'w') as file:
        writer = csv.writer(file)
        writer.writerows(test_data)

    # Process the data
    sales_df = mks_sales_dataframe(tmp_file)

    # Compare expected and actual output
    testing.assert_frame_equal(expected, sales_df)

    # Clean the temporary file
    os.remove(tmp_file)

You can also create your .csv and add them as test resources, but you'll have different locations for your input and expected output, which is not that great.

What's the best way to unit test functions that handle csv files?

Test Function

Main Function

Answers (2)

Related Questions

What&#39;s the best way to unit test functions that handle csv files?

Test Function

Main Function

Answers (2)

Related Questions

What's the best way to unit test functions that handle csv files?