Reputation: 313
I am importing 4000+ csv files all with the same columns, columns=['Date', 'Datapint']
the importing the csv's to dask is pretty straight forward and is working fine for me.
file_paths = '/root/data/daily/'
df = dd.read_csv(file_paths+'*.csv',
delim_whitespace=True,
names=['Date','Datapoint'])
The task I am trying to achive is to be able to name the 'Datapoint'
column the filename of the .csv. I know you can set a column to the path using include_path_column = True
. But I am wondering if there is a simple way use that pathname as a column name with out having to run a separate step down the line.
Upvotes: 3
Views: 2438
Reputation: 3282
It is unclear to me what exactly you are trying to accomplish. If you are just trying to change the name of the column that the filepaths are written to, you can set include_path_column='New Column Name'
. If you are naming a column based on the path to each file, it seems like you'll get a rather sparse array once the data are concatenated and I would argue that a groupby would probably work better.
Upvotes: 2
Reputation: 313
I was able to do this (fairly straight forward) using dask's delayed function:
import pandas as pd
import dask.dataframe as dd
from dask import delayed
import glob
path = r'/root/data/daily' # use your path
file_list = glob.glob(path + "/*.csv")
def read_and_label_csv(filename):
# reads each csv file to a pandas.DataFrame
df_csv = pd.read_csv(filename,
delim_whitespace=True,
names=['Date','Close'])
df_csv.rename(columns={'Close':path_2_column}, inplace=True)
return df_csv
# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)
Upvotes: 6