the phoenix
the phoenix

Reputation: 739

Import all csv files existing in a folder and group them based on their names?

I have around 1000 csv files in a directory, each 4 of them have the same name with a different number at the end. For example:

ABC_0.csv 
ABC_1.csv 
ABC_2.csv 
ABC_3.csv  
...
DIJ_0.csv
DIJ_1.csv
DIJ_2.csv
DIJ_3.csv

I can import them all and put each file in its own data frame, so I would have a list of data frames LIST_DATA. But I would like to group each 4 data-frames based on their names automatically instead. So I would have the ABC files in List of dataframes LIST_ABC and List of dataframes for the DIJ LIST_DIJ. Or if possible, maybe add their names as a column where I can filter the dataframes with the name column?

This is what I have so far :

import os
import glob

path1 = 'D:\folder'

all_files1 = glob.glob(path1 + "/*.csv")
all_files1.sort(key=os.path.getmtime)
List_DATA = []

for filename in all_files1:
    data = pd.read_csv(filename, index_col=None)  
    List_DATA.append(data)

Upvotes: 0

Views: 1117

Answers (3)

OneTrickDragon
OneTrickDragon

Reputation: 87

First off I want to say that my solution is only robust given the fact that there will always be 4 files that belong grouped together and there won't be missing anything. If you want to make it more robust filenameparsing should be used.

As far as I understand the question you want to get the data from four csv files with the same string prefix grouped together in a list. That then is embedded in a bigger list for all the data there is in the 1000 files. Therefore I would not sort by timestamp but by name and then simply store the files in lists that get added to a bigger one after four items were added and subsequently resetet. This is my code then:

import os
import glob
import pandas as pd

path1 = 'D:\folder'
    
all_files1 = glob.glob("*.csv")
# Sort by name not timestamp
all_files1.sort()
List_DATA = []
# For Storing sub list of data frames
SubList_DATA = []
        
for idx,filename in enumerate(all_files1):
    data = pd.read_csv(filename, index_col=None)  
    SubList_DATA.append(data)
    # Every 4th time the sublist gets stored in main list and reset. 
    if idx%4==3:
        List_DATA.append(SubList_DATA)
        SubList_DATA = []

EDIT: I just hacked a version together that makes use of the filenames and will work even if there are more or less files in a group:

import os
import glob
import pandas as pd

path1 = 'D:\folder'

all_files1 = glob.glob("*.csv")
# Sort by name not timestamp
all_files1.sort()
List_DATA = []
# For Storing sub list of data frames
SubList_DATA = []
# For keeping track which sublist is generated.
currentprefix = ""

for idx,filename in enumerate(all_files1):
    # Parse prefix string from filename 
    prefix, suffix = filename.split("_")
    # Since sorted the prefix should be change only once and nether reappear
    if currentprefix != prefix:
        # Skip this at the first step
        if idx != 0:
            # Add sublist to major one and reset it
            List_DATA.append(SubList_DATA)
            SubList_DATA = []
        # Set current prefix to the current block of read in files
        currentprefix = prefix
    # Add data to sublist
    data = pd.read_csv(filename, index_col=None)
    SubList_DATA.append(data)

# Finally add last sublist
List_DATA.append(SubList_DATA)

Upvotes: 1

Eitan Rosati
Eitan Rosati

Reputation: 553

You can read all csv files in this way(the files will contain all the path of the csv files):

import os
files = []
List_DATA = []
os.chdir("D:\folder")
# Add the path of csv folder
for i in os.listdir("D:\folder"):
    if i.endswith('.csv'):
        files.append(i)
files.sort(key=os.path.getmtime)

Then you can continue with your code:

for filename in files:
    data = pd.read_csv(filename, index_col=None)  
    List_DATA.append(data)

Upvotes: 0

Netim
Netim

Reputation: 135

If i understand correctly you question you could try something like this:

First create a dictionary with the key being the prefix in commun (ex: 'ABC') and the value a list containing the dataframe corresponding to each csv having the key as prefix.

Then merge them

import os
import glob
import pandas as pd
path1 = 'D:\folder'

all_files1 = glob.glob(path1 + "/*.csv")
all_files1.sort(key=os.path.getmtime)

all_data = {}


for filename in all_files1:
    name = os.path.basename(filename).split('csv')[0]
    if name in all_data:
        all_data[name].append(pd.read_csv(filename, index_col=None))
    else:
        all_data[name] = [pd.read_csv(filename, index_col=None)]

list_data = []
for k,v in all_data.items():
    list_data.append((k, pd.concat(v, axis=1)))

Upvotes: 0

Related Questions