ggcoder
ggcoder

Reputation: 81

Pandas dataframe concat after reading large number of txt files using glob takes never ending time

There are some 50k txt files which I am trying to read to pandas dataframe as per code below. But the process is still running for 2 hrs. Is there any better way to speed up this?

folder_path = '/drive/My Drive/dataset/train'
file_list = glob.glob(folder_path + "/*.txt")


def read_clean_df(file_name) -> pd.DataFrame:
    df = pd.read_fwf(file_name, header=None)
    df = df.drop(df.index[19])    
    df = df.T           
    df.columns = df.iloc[0]
    df = df[1:]
    df.reset_index(drop=True, inplace=True)
    return df


train_df = read_clean_df(file_list[0])

for file_name in file_list[1:len(file_list)]:
    df = read_clean_df(file_name)
    train_df = pd.concat([train_df, df], axis=0)

train_df.reset_index(drop=True, inplace=True)
print(train_df.head(30))

Upvotes: 0

Views: 39

Answers (2)

Tom McLean
Tom McLean

Reputation: 6349

Do the concatenation once, at the end:

dfs = []
for file_name in file_list:
    df = read_clean_df(file_name)
    dfs.append(df)
tran_df = pd.concat(dfs, axis=0)

If that is not fast enough, use datatable which can do multithreaded IO reading of csv files:

df = dt.rbind(iread(file_list)).to_pandas()

Upvotes: 1

ignoring_gravity
ignoring_gravity

Reputation: 10511

Yeah, repeatedly calling concat is slow, this is the reason DataFrame.append was deprecated

Instead, do

dfs = []

for file_name in file_list:
    df = read_clean_df(file_name)
    dfs.append(df)

train_df = pd.concat(dfs)

Upvotes: 1

Related Questions