parastoo91
parastoo91

Reputation: 151

Applying the same operations on multiple .csv file in pandas

I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames. This is my code for reading them.

#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")

Each time I run the kernel, I activate one of the files to be read. I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.

Upvotes: 3

Views: 5856

Answers (5)

Khemraj Sawantmorye
Khemraj Sawantmorye

Reputation: 1

# import libraries 
import pandas as pd
import glob

# store file paths in a variable

project_folder = r"C:\file_path\"

# Save all file path in a variable
 
all_files_paths = glob.glob(project_folder + "/*.csv")

# Create a list to save whole data
li = []

# Use list comprehension to iterate over all files; and append data in each file to list

list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]

# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)

Upvotes: 0

kuropan
kuropan

Reputation: 896

Use glob.glob to get all files with similar names:

import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
    df = pd.read_csv(f)
    # manipulate df
    df.to_csv(f)

This will match yellow_tripdata_2018-0<any one character>.csv. You can also use yellow_tripdata_2018-0*.csv to match yellow_tripdata_2018-0<anything>.csv or even yellow_tripdata_*.csv to match all csv files that start with yellow_tripdata.

Note that this also only loads one file at a time.

Upvotes: 2

Jab
Jab

Reputation: 27515

You could use a list to hold all of the dataframes:

number_of_files = 6
dfs = []

for file_num in range(len(number_of_files)):
    dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()

Then to get a certain dataframe use:

df1 = dfs[0]

Edit:

As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:

for file_num in range(len(number_of_files)):
    with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
        dfs.append(csv.reader(iter(f.readline, '')))

Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.

P.S.

You may need multi-threading to iterate through each one at the same time.

Loading/Editing/Saving: - using csv module

Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)

If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into @seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.

number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    #notice I'm opening the original file as f in mode 'r' for read only
    #and the new file as nf in mode 'a' for append
    with open(filename.format(str(file_num).zfill(2)), 'r') as f,
         open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
        #initialize the writer before looping every line
        w = csv.writer(nf)
        for row in csv.reader(f):
            #do your "data cleaning" (THIS IS PER-LINE REMEMBER)
        #save to file
        w.writerow(row)

Note:

You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.

Pandas Approach - using chunks

PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.

Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs

number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
        # Do your data cleaning
        chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks

For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.

Upvotes: 3

seralouk
seralouk

Reputation: 33147

Use for and format like this. I use this every single day:

number_of_files = 6

for i in range(1, number_of_files+1):
    df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))

    #your code here, do analysis and then the loop will return and read the next dataframe

Upvotes: 3

Pallie
Pallie

Reputation: 1109

Use os.listdir() to make a list of files you can loop through?

samplefiles = os.listdir(filepath)
for filename in samplefiles:
    df = pd.read_csv(filename)

where filepath is the directory containing multiple csv's?

Or a loop that changes the filename:

for i in range(1, 7):
    df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))

Upvotes: 0

Related Questions