Reputation: 1052
I have a large dataset with a defined structure of columns for which I have built a script/pipeline that generally does: first, ingests the data (formatting, cleaning, etc.), and second, it transform values and creates a new column with these new transformed values (final result), more or less like this:
My script is divided into two files (~150 lines of code, each) and is composed of many methods: .where, .replace, .map, .apply, .etc
. Given that pandas allows for method chaining and is very flexible, the dataset can be processed without defining any function (except a few for ) df.apply(func)
. My code gets the csv into a df and naturally starts processing it with the mentioned methods .where, .replace, .map, .apply, .etc
without using any function or a .pipeline
method. My project tree looks like:
/project
table.csv
ingest.py (outputs a clean intermediate_table.csv)
transform.py (reads previous table.csv and outputs a final_table.csv)
final_table.csv
The thing is, I need to send this code over to other people who will run my script in more datasets, so I will need to comment and test it. Given the above, here are my questions in terms of the code structure.
E.g. Should I have multiple functions like below?:
df = pd.read_csv('file.csv')
def uppercase_column_A(dataframe, col)
def clean_column(dataframe, col)
def calculate_mean_here(dataframe, col)
def transform_values_there(dataframe, col)
df
.pipe(uppercase_column_A)
.pipe(clean_column)
.pipe(calculate_mean_here)
.pipe(transform_values_there)
.pipe(etc)
)
or, maybe, just two big functions ?
df = pd.read_csv('file.csv')
def ingest(df): returns intermediate_df
def transform(intermediate_df)
df
.pipe(ingest)
.pipe(transform)
I know the question is broad but I think common practices are important as well as the code itself. In academia (my background), this does not matter much as there is not a 'production' side. So, in general, what would be a recommended industry-way of building data pipelines in terms of code/structure?
Upvotes: 6
Views: 1782
Reputation: 1745
In my experience, using smaller functions is better for maintenance since error codes will be easier to follow the fewer level of abstractions there are (which is what having 2 big functions will not do).
My personal suggestion:
Add as many comments as you can. Above functions, above variable names, below a function call, etc...
Be as descriptive about naming structure. calculate_mean_of_columns
instead of calc_mean_cols
, for example. Avoid, as much as you can, using abbreviations (even standard abbreviations in the DS community) like df
or cols
.
I'd structure my folders differently, honestly. My typical pipelines have had a consistent structure like this:
/project
/code
code_to_transform_dataframe.py
/data
datetimestamp_filename.csv
/output
datetimestamp_output.csv
You can use this as a framework for your own use case but that's for the work I've done in a couple of different companies.
Upvotes: 2