Reputation: 111
I have a class that does some validates on a pandas dataframe I read. The class looks something like this (simplified some stuff might make no sense)
import pandas as pd
class PandasValidator:
read_kwargs = {'sep'='\t',header=None}
def __init__(self,path_to_data:str,max_rows:int) -> None:
self.path = path
def validate_num_rows(self,threshold: float = 0.1) -> bool:
df_shape = pd.read_csv(self.path,*self.read_kwargs).shape
return df_shape[0]*threshold <= self.max_rows
I want to test the method validate_num_rows, so I would like to patch the first line of the function, I don't to read an actual df when testing it, my test would look something like this (this is not working code, my best attempt).
@patch('df.read_csv') #not sure what goes in here
def test_validate_num_rows(mock) -> None:
mock.shape=(30,30)
result = PandasValidator('dummy-path',30).validate_num_rows(0.1)
assert result == True
To be honest I have no idea what to patch and mock or how to do it. I want to mock the first line of the validate_num_rows method. I know refactoring the code would make testing easier but that's not a choice I have
Upvotes: 1
Views: 619
Reputation: 3082
Your class would be easier to test if it accepted a dataframe instead of reading it itself.
import pandas as pd
class PandasValidator:
def __init__(self, df: pd.DataFrame, max_rows: int) -> None:
self._df = df
self._max_rows = max_rows
def validate_num_rows(self, threshold: float = 0.1) -> bool:
return self._df.shape * threshold <= self._max_rows
Now in your test, you just need to construct a dataframe in memory and pass it to PandasValidator
.
Then you can make another function that reads a dataframe from a file:
import pandas as pd
from pathlib import Path
def read_csv(path: Path) -> pd.DataFrame:
return pd.read_csv(path, sep='\t', header=None)
If you want to test this function, you can use the monkeypatch
fixture from pytest
:
import pandas
import your_module
def test_read_csv(monkeypatch):
expected_dataframe = # create a dataframe somehow
def fake_read_csv(path, **kwargs):
assert path == Path('/foo/bar')
assert kwargs == {'sep': '\t', 'header': None}
return expected_dataframe
monkeypatch.setattr(pandas, "read_csv", fake_read_csv)
actual_dataframe = your_module.read_csv(Path('/foo/bar'))
assert actual_dataframe == expected_dataframe
Another approach is to not use monkeypatching at all, but to test the function against the real file system using temporary files:
from pathlib import Path
import your_module
def test_read_csv(tmp_dir: Path):
csv_path = tmp_dir / "my_fake_csv.csv"
csv_path.write_text(some_predefined_csv)
dataframe = your_module.read_csv(csv_path)
# TODO: assert some properties about your dataframe
Some people would characterize it as an integration test, not a unit test, because it talks to the real file system. Perhaps it's more suitable for this function because it doesn't do much except talk to the file system. Also see "Stop Using Mocks (for a while)" by Harry Percival
Upvotes: 1