get directory names with map_partitions in dask dataframes

Question

I'm looking for some help with dask dataframe results. I have a dask dataframe with 144 dataframes from 144 csv files. I'd like to obtain the maximum value from one column of these dataframes and return it, along with the name of the folder it belongs to. I've been using map_patitions to obtain the result I am looking for, however, there is no identifier associated with the partition result, so it is difficult to apply the result to other uses. Any help would be greatly appreciated! Here is a sample of code I'm using:

ddf = dd.read_csv(f'{dir}/*/name.csv')['column 1'] # dir contains 144 folders, each with name.csv
def get_max (ddf):
    return  ddf.max(axis = 0) 
result = ddf.map_partitions(get_max).compute()
print(result)

result contains the values I want, indexed as 'column 1'. I would like the name of the folder (essentially the * folder) as the index. My end goal is a dataframe with index of folder or directory name and a column of the max values returned from the function.

get directory names with map_partitions in dask dataframes

Answers (1)

Related Questions