Reza Mirhossein
Reza Mirhossein

Reputation: 41

reading multiple files into dask dataframe

I want to read multiple csv files into one single dask dataframe. Due to some reasons some portion of my original data get lost (no clue why?!). I am wondering whats the best method to read them all into dask? I used a for loop though not sure if its correct.

 for file in os.listdir(dds_glob):
    if file.endswith('issued_processed.txt'):
        ddf = dd.read_fwf(os.path.join(dds_glob,file),
                          colspecs=cols,
                          header=None,
                          dtype=object,
                          names=names)

or should I use something like this:

dfs = delayed(pd.read_fwf)('/data/input/*issued_processed.txt',
                           colspecs=cols,
                           header=None,
                           dtype=object,
                           names=names)  
ddf = dd.from_delayed(dfs)

Upvotes: 2

Views: 1124

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16561

There are at least two approaches:

  1. provide dask.dataframe with a list of files, so using your first snippet it would look like:
file_list = [
    os.path.join(dds_glob,file)
    for file os.listdir(dds_glob) if file.endswith('issued_processed.txt')
]

# other options are skipped for convenience
ddf = dd.read_fwf(file_list)
  1. construct dataframe from delayed objects, which using your second snippet would look like:
# other options are skipped, but can be included after the `file`
dfs = [delayed(pd.read_fwf)(file) for file in file_list] 
ddf = dd.from_delayed(dfs)

The first approach is something that will solve about 82% of the use-cases, but for the other cases you might need to try the second approach or something more involved.

Upvotes: 2

Related Questions