Kenzie
Kenzie

Reputation: 21

get directory names with map_partitions in dask dataframes

I'm looking for some help with dask dataframe results. I have a dask dataframe with 144 dataframes from 144 csv files. I'd like to obtain the maximum value from one column of these dataframes and return it, along with the name of the folder it belongs to. I've been using map_patitions to obtain the result I am looking for, however, there is no identifier associated with the partition result, so it is difficult to apply the result to other uses. Any help would be greatly appreciated! Here is a sample of code I'm using:

ddf = dd.read_csv(f'{dir}/*/name.csv')['column 1'] # dir contains 144 folders, each with name.csv
def get_max (ddf):
    return  ddf.max(axis = 0) 
result = ddf.map_partitions(get_max).compute()
print(result)

result contains the values I want, indexed as 'column 1'. I would like the name of the folder (essentially the * folder) as the index. My end goal is a dataframe with index of folder or directory name and a column of the max values returned from the function.

Upvotes: 2

Views: 481

Answers (1)

MRocklin
MRocklin

Reputation: 57301

I believe that you are looking for the include_path_column= keyword for the dask.dataframe.read_csv function.

You can see the documentation for this function here: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

Upvotes: 1

Related Questions