PV8
PV8

Reputation: 6260

Loading multiple csv files of a folder into one dataframe

i have multiple csv files saved in one folder with the same column layout and want to load it into python as a dataframe in pandas.

The question is really simliar to this thread.

I am using the following code:

import glob
import pandas as pd
salesdata = pd.DataFrame()
for f in glob.glob("TransactionData\Promorelevant\*.csv"):
    appenddata = pd.read_csv(f, header=None, sep=";")
    salesdata = salesdata.append(appenddata,ignore_index=True)

Is there a better solution for it with another package?

This is taking to much time.

Thanks

Upvotes: 16

Views: 30056

Answers (4)

PascalVKooten
PascalVKooten

Reputation: 21433

Maybe using bash will be faster:

head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
tail -q -n +2 TransactionData/Promorelevant*.csv >> merged.csv

Or if using from within a jupyter notebook

!head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
!tail -q -n +2 "TransactionData/Promorelevant*.csv" >> merged.csv

The idea being that you won't need to parse anything.

The first command copies the header of one of the files. You can skip this line if you don't have a header. Tail skips the headers for all the files and adds them to the csv.

Appending in Python is probably more expensive.

Of course, make sure your parse is still valid using pandas.

pd.read_csv("merged.csv")

Curious to your benchmark.

Upvotes: 2

PV8
PV8

Reputation: 6260

i checked all this approaches except the bash one with the time function (only one run, and also note that the files are on a shared drive).

Here are the results:

My approach: 1220.49

List comphrension+concat: 1135.53

concat+map+join: 1116.31

I will go for list comphrension+concat which will save me some minutes and i feel quite familiar with.

Thanks for your ideas.

Upvotes: 0

Muhammad Haseeb
Muhammad Haseeb

Reputation: 644

With a help from link to actual answer

This seems to be the best one liner:

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "*.csv"))))

Upvotes: 6

jezrael
jezrael

Reputation: 862441

I suggest use list comprehension with concat:

import glob
import pandas as pd

files = glob.glob("TransactionData\Promorelevant*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]

salesdata = pd.concat(dfs,ignore_index=True)

Upvotes: 17

Related Questions