Reputation: 21
I'm looking for some help with dask dataframe results. I have a dask dataframe with 144 dataframes from 144 csv files. I'd like to obtain the maximum value from one column of these dataframes and return it, along with the name of the folder it belongs to. I've been using map_patitions to obtain the result I am looking for, however, there is no identifier associated with the partition result, so it is difficult to apply the result to other uses. Any help would be greatly appreciated! Here is a sample of code I'm using:
ddf = dd.read_csv(f'{dir}/*/name.csv')['column 1'] # dir contains 144 folders, each with name.csv
def get_max (ddf):
return ddf.max(axis = 0)
result = ddf.map_partitions(get_max).compute()
print(result)
result contains the values I want, indexed as 'column 1'. I would like the name of the folder (essentially the * folder) as the index. My end goal is a dataframe with index of folder or directory name and a column of the max values returned from the function.
Upvotes: 2
Views: 481
Reputation: 57301
I believe that you are looking for the include_path_column=
keyword for the dask.dataframe.read_csv
function.
You can see the documentation for this function here: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
Upvotes: 1