Reputation: 683
I'm trying to unit-test a function that deals with csv files with Pytest. While my function works, I feel like there's a lot of code repetition when creating "sample" csv files in my project directory to test the function. The actual csv file that holds the real data has millions of records.
These are not the only csv files I have to test in my module, so it would be immensely helpful to know what's the best way to test functions that work with different file structures.
Right now I'm creating a very short csv file that mimics the actual file schema with a single line of data plus expected dataframe output after the file is processed through the function.
Perhaps mocking is the way to go? But I feel like you shouldn't need to mock for this kind of testing
@pytest.mark.parametrize('test_file, expected', [
(r'Path\To\Project\Output\Folder\mock_sales1.csv',
pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales'])),
(r'Path\To\Project\Output\Folder\mock_sales2.csv',
pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales']))
])
def test_sales_dataframe(test_file, expected):
# This part is repetitive, different tests each need a seperate file written within the test function.
# Writing sample file to test that files with 7 columns are read correctly.
mock_mks_sales1 = [['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 4000]]
with open(r'Path\To\Project\Output\Folder\mock_sales1.csv', 'w') as file:
writer = csv.writer(file)
writer.writerows(mock_sales1)
# Writing sample file to test that files with 8 columns are read correctly.
mock_mks_sales2 = [['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 'Data6', 4000]]
with open(r'Path\To\Project\Output\Folder\mock_sales2.csv', 'w') as file:
writer = csv.writer(file)
writer.writerows(mock_sales2)
sales_df = mks_sales_dataframe(test_file)
testing.assert_frame_equal(expected, sales_df)
os.remove(r'Path\To\Project\Output\Folder\mock_sales1.csv')
os.remove(r'Path\To\Project\Output\Folder\mock_sales2.csv')
def sales_dataframe(file):
try:
with open(file, 'r') as f:
reader = csv.reader(f)
num_cols = len(next(reader))
columns = [1, 2, (num_cols - 1)] # Number of columns is variable, this is used later to accurately specify which columns should be read. This is part I'm testing!
sales_df = pd.read_csv(file, usecols=columns, names=['Postal_Code', 'Store_Num', 'Sales'])
return sales_df
except FileNotFoundError:
raise FileNotFoundError(file)
The test passes as intended. However, for every different test I have to create a sample csv file within the test function and delete each file once the test is finished. As you can imagine that's a lot of repetitive code within a single test function which feels quite clunky and wordy, especially when the test is parameterized.
Upvotes: 8
Views: 22053
Reputation: 121
One way to reduce some repetition is using setUp and tearDown methods for the TestCase
import os
import csv
import unittest
test_file = 'test.csv'
rows = [
['0a', '0b', '0c'],
['1a', '1b', '1c'],
]
class TestCsv(unittest.TestCase):
def setUp(self):
with open(test_file, 'w', newline='') as csv_file:
writer = csv.writer(csv_file, dialect='excel')
writer.writerows(rows)
def tearDown(self):
os.remove(test_file)
def test_read_line(self):
with open(test_file, 'r') as csv_file:
reader = csv.reader(csv_file, dialect='excel')
self.assertEqual(next(reader), rows[0])
self.assertEqual(next(reader), rows[1])
if __name__ == "__main__":
unittest.main()
Upvotes: 3
Reputation: 3138
I think the problem is that your test input and expected output are strongly tied but located at two different places, one in the parameters and the other in the test code.
If you change one parameter, you'll need to change the method body of your test which is not right imo, in addition of the duplicated code.
I think that you should have the parameters test(test_data, expected output)
and inject the input in a temporary file.
Then you call your function and compare the expected and actual output.
@pytest.mark.parametrize('test_data, expected', [
([['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 4000]],
pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales'])),
([['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 'Data6', 4000]],
pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales']))
])
def test_sales_dataframe(test_data, expected):
# Write your test data in a temporary file
tmp_file = r'Path\To\Project\Output\Folder\tmp.csv';
with open(tmp_file, 'w') as file:
writer = csv.writer(file)
writer.writerows(test_data)
# Process the data
sales_df = mks_sales_dataframe(tmp_file)
# Compare expected and actual output
testing.assert_frame_equal(expected, sales_df)
# Clean the temporary file
os.remove(tmp_file)
You can also create your .csv and add them as test resources, but you'll have different locations for your input and expected output, which is not that great.
Upvotes: 4