Matan
Matan

Reputation: 117

Create Individual DataFrames From a List of csv Files

I have a folder of csv files that I'd like to loop over to create individual DataFrames named after the file itself. So if I have file_1.csv, file_2.csv, file_3.csv ... I'd like DataFrames created for each file and have the df named after the file of the data it contains.

Here is what I've tried so far:

# get list of all files
all_files = os.listdir("./Data/")

# get list of only csv files
csv_files = list(filter(lambda f: f.endswith('.csv'), all_files))

# remove file extension to get name only
file_names = []
for i in csv_files:
    file = i[:-4]
    file_names.append(file)
    
# create DataFrames from each file named after the corresonding file
dfs = []
def make_files_dfs():
    for a,b in zip(file_names, csv_files):
        if a == b[:-4]:
            a = pd.read_csv(eval(f"'Data/{b}'"))
            dfs.append(a)

error code:

ParserError: Error tokenizing data. C error: Expected 70 fields in line 7728, saw 74

Update 1

new attempt 1:

path = "./Data/"
os.chdir(path)

csv_files = glob.glob("*.csv")

dataFrameDict = {}
def make_files_dfs():
    for a in csv_files:
        dataFrameDict[a[:-4] , pd.read_csv(a)]

Error Code:

TypeError: unhashable type: 'DataFrame'

I feel like this needs a line to append the dicts to a list; will mess with it.

new attempt 2:

path = "./Data/"
os.chdir(path)

csv_files = glob.glob("*.csv")

for i in range(len(csv_files)):
    globals()[f"df_{i}"] = pd.read_csv(csv_files[i])

Error code:

ParserError: Error tokenizing data. C error: Expected 70 fields in line 7728, saw 74


Update 2

path = "./Data/"
os.chdir(path)

csv_files = glob.glob("*.csv")

csv_names = []
for i in csv_files:
    name = i[:-4]
    csv_names.append(name)
    
zip_object = zip(csv_names, csv_files)

df_collection = {}
for name, file in zip_object:
    df_collection[name] = pd.read_csv(file, low_memory=False)

Upvotes: 0

Views: 1414

Answers (2)

Ammar Bin Aamir
Ammar Bin Aamir

Reputation: 115

I don't understand why your code is so lengthy, but this can be done by following:

csv_list = ['file_1.csv', 'file_2.csv', 'file_3.csv']
for i in range(len(csv_list)):
    globals()[f"df_{i}"] = pd.read_csv(csv_list[i])

Output:

Three dataframes will be created. df_1 will have 1st file in the list, df_2 will have 2nd file in the list and so on..

Upvotes: 1

Raman
Raman

Reputation: 56

Your code is a bit difficult to understand. You have some unnecessary functions. First of all, it is easier to change the working directory path (by os.chdir(path). Secondly, you can get rid of your lambda function and use glob.glob. Lastly, you cannot make a DataFrame named after a variable. Your dfs list will hold some class names that won't give you much insight into the DataFrame. It is much better to use a dictionary. Overall, this is how your code can look like:

import os
import glob

path = "the path to your data"
os.chdir(path)

# get list of only csv files
csv_files = glob.glob("/*.csv")

# create a dictionary with key as the DF name and values as DataFrames 
dataFrameDictionary={}
def make_files_dfs():
    for a in csv_files:
        dataFrameDictionary[a[:-4], pd.read_csv(a)]
            

Upvotes: 1

Related Questions