Reputation: 41
I want to read multiple csv files into one single dask dataframe. Due to some reasons some portion of my original data get lost (no clue why?!). I am wondering whats the best method to read them all into dask? I used a for loop though not sure if its correct.
for file in os.listdir(dds_glob): if file.endswith('issued_processed.txt'): ddf = dd.read_fwf(os.path.join(dds_glob,file), colspecs=cols, header=None, dtype=object, names=names)
or should I use something like this:
dfs = delayed(pd.read_fwf)('/data/input/*issued_processed.txt', colspecs=cols, header=None, dtype=object, names=names) ddf = dd.from_delayed(dfs)
Upvotes: 2
Views: 1124
Reputation: 16561
There are at least two approaches:
dask.dataframe
with a list of files, so using your first snippet it would look like:file_list = [
os.path.join(dds_glob,file)
for file os.listdir(dds_glob) if file.endswith('issued_processed.txt')
]
# other options are skipped for convenience
ddf = dd.read_fwf(file_list)
delayed
objects, which using your second snippet would look like:# other options are skipped, but can be included after the `file`
dfs = [delayed(pd.read_fwf)(file) for file in file_list]
ddf = dd.from_delayed(dfs)
The first approach is something that will solve about 82% of the use-cases, but for the other cases you might need to try the second approach or something more involved.
Upvotes: 2