Reputation: 739
I have around 1000 csv files in a directory, each 4 of them have the same name with a different number at the end. For example:
ABC_0.csv
ABC_1.csv
ABC_2.csv
ABC_3.csv
...
DIJ_0.csv
DIJ_1.csv
DIJ_2.csv
DIJ_3.csv
I can import them all and put each file in its own data frame, so I would have a list of data frames LIST_DATA. But I would like to group each 4 data-frames based on their names automatically instead. So I would have the ABC files in List of dataframes LIST_ABC and List of dataframes for the DIJ LIST_DIJ. Or if possible, maybe add their names as a column where I can filter the dataframes with the name column?
This is what I have so far :
import os
import glob
path1 = 'D:\folder'
all_files1 = glob.glob(path1 + "/*.csv")
all_files1.sort(key=os.path.getmtime)
List_DATA = []
for filename in all_files1:
data = pd.read_csv(filename, index_col=None)
List_DATA.append(data)
Upvotes: 0
Views: 1117
Reputation: 87
First off I want to say that my solution is only robust given the fact that there will always be 4 files that belong grouped together and there won't be missing anything. If you want to make it more robust filenameparsing should be used.
As far as I understand the question you want to get the data from four csv files with the same string prefix grouped together in a list. That then is embedded in a bigger list for all the data there is in the 1000 files. Therefore I would not sort by timestamp but by name and then simply store the files in lists that get added to a bigger one after four items were added and subsequently resetet. This is my code then:
import os
import glob
import pandas as pd
path1 = 'D:\folder'
all_files1 = glob.glob("*.csv")
# Sort by name not timestamp
all_files1.sort()
List_DATA = []
# For Storing sub list of data frames
SubList_DATA = []
for idx,filename in enumerate(all_files1):
data = pd.read_csv(filename, index_col=None)
SubList_DATA.append(data)
# Every 4th time the sublist gets stored in main list and reset.
if idx%4==3:
List_DATA.append(SubList_DATA)
SubList_DATA = []
EDIT: I just hacked a version together that makes use of the filenames and will work even if there are more or less files in a group:
import os
import glob
import pandas as pd
path1 = 'D:\folder'
all_files1 = glob.glob("*.csv")
# Sort by name not timestamp
all_files1.sort()
List_DATA = []
# For Storing sub list of data frames
SubList_DATA = []
# For keeping track which sublist is generated.
currentprefix = ""
for idx,filename in enumerate(all_files1):
# Parse prefix string from filename
prefix, suffix = filename.split("_")
# Since sorted the prefix should be change only once and nether reappear
if currentprefix != prefix:
# Skip this at the first step
if idx != 0:
# Add sublist to major one and reset it
List_DATA.append(SubList_DATA)
SubList_DATA = []
# Set current prefix to the current block of read in files
currentprefix = prefix
# Add data to sublist
data = pd.read_csv(filename, index_col=None)
SubList_DATA.append(data)
# Finally add last sublist
List_DATA.append(SubList_DATA)
Upvotes: 1
Reputation: 553
You can read all csv files in this way(the files will contain all the path of the csv files):
import os
files = []
List_DATA = []
os.chdir("D:\folder")
# Add the path of csv folder
for i in os.listdir("D:\folder"):
if i.endswith('.csv'):
files.append(i)
files.sort(key=os.path.getmtime)
Then you can continue with your code:
for filename in files:
data = pd.read_csv(filename, index_col=None)
List_DATA.append(data)
Upvotes: 0
Reputation: 135
If i understand correctly you question you could try something like this:
First create a dictionary with the key being the prefix in commun (ex: 'ABC') and the value a list containing the dataframe corresponding to each csv having the key as prefix.
Then merge them
import os
import glob
import pandas as pd
path1 = 'D:\folder'
all_files1 = glob.glob(path1 + "/*.csv")
all_files1.sort(key=os.path.getmtime)
all_data = {}
for filename in all_files1:
name = os.path.basename(filename).split('csv')[0]
if name in all_data:
all_data[name].append(pd.read_csv(filename, index_col=None))
else:
all_data[name] = [pd.read_csv(filename, index_col=None)]
list_data = []
for k,v in all_data.items():
list_data.append((k, pd.concat(v, axis=1)))
Upvotes: 0