Loading multiple Excel files into Pandas

Question

I have about 5400 Excel files in multiple (sub)folders and want to load them into a single dataframe. The files only have 1 sheet and and can have up to 2000+ rows each. The total number of rows is expected to be 2 Million or more.

My computer has SSD HD and 8GB memory, and is pretty fast. Still it takes hours to complete. Is there anything wrong with me code? I'd appreciate any tips.

%%time
files = glob.glob('asyncDatas/**/*.xlsx',recursive=True)

df = pd.DataFrame()

for num, fname in enumerate(files, start=1):
    print("File #{} | {}".format(num, fname))
    if len(fname) > 0:
        data = pd.read_excel(fname, 'Sheet0', index_col='Time', skiprows=3)
        df = df.append(data)

df.head()

My hunch is that the .append method takes too much time as it likely is dynamically re-allocate memory? Would .concat() maybe the better approach?

jezrael · Accepted Answer

First append to list of DataFrames and last only once concat, but still not sure if 8GB RAM is enough (but I hope so):

dfs = []

for num, fname in enumerate(files, start=1):
    print("File #{} | {}".format(num, fname))
    if len(fname) > 0:
        data = pd.read_excel(fname, 'Sheet0', index_col='Time', skiprows=3)
        dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

Loading multiple Excel files into Pandas

Answers (2)

Related Questions