Reputation: 7703
I am doing a data cleaning using Python. I have got the below workflow to call all my functions
if __name__ == "__main__":
data_file, hash_file, cols = read_file()
survey_data, cleaned_hash_file = format_files(data_file, hash_file, cols)
survey_data, cleaned_hash_file = rename_columns(survey_data, cleaned_hash_file)
survey_data, cleaned_hash_file = data_transformation_stage_1(survey_data, cleaned_hash_file)
observation, survey_data, cleaned_hash_file = data_transformation_stage_2(survey_data, cleaned_hash_file)
observation, survey_data, cleaned_hash_file = data_transformation_stage_3(observation, survey_data, cleaned_hash_file)
observation, survey_data, cleaned_hash_file = observation_date_fill(observation, survey_data, cleaned_hash_file)
write_file(observation, survey_data, cleaned_hash_file)
So, the output (return statement variables) from each function is used as an input to the subsequent functions. All the functions return dataframe as an output. So observation
,survey_data
,cleaned_hash_file
,data_file
,hash_file
,cols
are all dataframes used in each function.
Is there any other better and elegant way to write this?
Upvotes: 2
Views: 857
Reputation: 3103
You can extend python map
to accept mapping multiple functions, it will go as:
def map_many(iterable, function, *other):
if other:
return map_many(map(function, iterable), *other)
return map(function, iterable)
inputs = read_file()
dfs_1 = map_many(inputs, format_files, rename_column, data_transformation_stage_1, data_transformation_stage_2)
dfs_2 = map_many(dfs_1, data_transformation_stage_3, observation_date_fill)
write_file(*dfs_2)
Upvotes: 1
Reputation: 444
Try iterating through your functions. It assumes that input of the current iteration has the same order as the output of the previous iteration:
funcs = [read_file, format_files, rename_columns, data_transformation_stage_1, data_transformation_stage_2, data_transformation_stage_3, observation_date_fill, write_file]
output = []
for func in funcs:
output = func(*output)
Upvotes: 6
Reputation: 622
Create this class:
class ProcessingChain:
def __init__(self, *callables):
self.operations = callables
def process(self, *args):
for operation in self.operations:
args = operation(*args)
return args
And use is like this:
processing = ProcessingChain(format_files, rename_columns, data_transformation_stage_1, data_transformation_stage_2, data_transformation_stage_3, observation_date_fill)
data_file, hash_file, cols = read_file()
observation, survey_data, cleaned_hash_file = processing.process(data_file, hash_file, cols )
write_file(observation, survey_data, cleaned_hash_file)
Upvotes: 1