How to structure the code of a python data pre-processing pipeline (for production)

Question

I have a large dataset with a defined structure of columns for which I have built a script/pipeline that generally does: first, ingests the data (formatting, cleaning, etc.), and second, it transform values and creates a new column with these new transformed values (final result), more or less like this:

Imports csv into pandas framework, fills nans, cleans some values in some columns, homogenizes text, names, etc. 1.1. Creates a new column (cleaned names)
Transform/convert values in another column via look ups in dictionaries, doing groupbys, etc. 2.2 Creates 1 new column (transformed values)

My script is divided into two files (~150 lines of code, each) and is composed of many methods: .where, .replace, .map, .apply, .etc. Given that pandas allows for method chaining and is very flexible, the dataset can be processed without defining any function (except a few for ) df.apply(func). My code gets the csv into a df and naturally starts processing it with the mentioned methods .where, .replace, .map, .apply, .etc without using any function or a .pipeline method. My project tree looks like:

/project
    table.csv
    ingest.py (outputs a clean intermediate_table.csv)
    transform.py (reads previous table.csv and outputs a final_table.csv)
    final_table.csv

The thing is, I need to send this code over to other people who will run my script in more datasets, so I will need to comment and test it. Given the above, here are my questions in terms of the code structure.

Should I have a function for each of the steps above?
If so, with what granularity?

E.g. Should I have multiple functions like below?:

df = pd.read_csv('file.csv')

def uppercase_column_A(dataframe, col) 
def clean_column(dataframe, col) 
def calculate_mean_here(dataframe, col)
def transform_values_there(dataframe, col)

df
.pipe(uppercase_column_A)
.pipe(clean_column)
.pipe(calculate_mean_here)
.pipe(transform_values_there)
.pipe(etc)
)

or, maybe, just two big functions ?

df = pd.read_csv('file.csv')
def ingest(df): returns intermediate_df
def transform(intermediate_df)

df
.pipe(ingest)
.pipe(transform)

Do I actually need to use .pipe? at all
Should I use classes? separate into modules?

I know the question is broad but I think common practices are important as well as the code itself. In academia (my background), this does not matter much as there is not a 'production' side. So, in general, what would be a recommended industry-way of building data pipelines in terms of code/structure?

zero · Accepted Answer

In my experience, using smaller functions is better for maintenance since error codes will be easier to follow the fewer level of abstractions there are (which is what having 2 big functions will not do).

My personal suggestion:

Add as many comments as you can. Above functions, above variable names, below a function call, etc...
Be as descriptive about naming structure. calculate_mean_of_columns instead of calc_mean_cols, for example. Avoid, as much as you can, using abbreviations (even standard abbreviations in the DS community) like df or cols.

I'd structure my folders differently, honestly. My typical pipelines have had a consistent structure like this:

/project
    /code
         code_to_transform_dataframe.py
    /data
         datetimestamp_filename.csv
    /output
         datetimestamp_output.csv

You can use this as a framework for your own use case but that's for the work I've done in a couple of different companies.

How to structure the code of a python data pre-processing pipeline (for production)

Answers (1)

Related Questions