Reputation: 676
I have a large list file paths pointing to csv files.
...SomeFileIDon'tNeed.csv
\001SMPL_1.csv
\001SMPL_2.csv
\001SMPL_3.csv
\001SMPL_4.csv
\001SMPL_5.csv
\001SMPL_6.csv
\002SMPL_1.csv
\002SMPL_2.csv
\002SMPL_3.csv
\002SMPL_4.csv
\002SMPL_5.csv
\002SMPL_6.csv
I want to split this list into sub-lists like so
[["\001SMPL_1.csv","\002SMPL_1.csv"],["\001SMPL_2.csv","\002SMPL_2.csv"],["\001SMPL_3.csv","\002SMPL_3.csv"],...]
This is because I need to combine all similar files into dataframes. So all files ending in _1 will be one df and files ending _2 will be another df, etc.
I wrote code that will split the original list into the sub-lists I want but it's not very efficient. I'm looking for a better way
#where f is the original list of file paths
dfs=[]
for i in range(len(f)):
temp=[]
for file in f:
count="_"+str(i)
if count in file:
temp.append(file)
dfs.append(temp)
Upvotes: 2
Views: 135
Reputation: 181
import operator
f = ["\001SMPL_1.csv",
"\001SMPL_2.csv",
"\001SMPL_3.csv",
"\001SMPL_4.csv",
"\001SMPL_5.csv",
"\001SMPL_6.csv",
"\002SMPL_1.csv",
"\002SMPL_2.csv",
"\002SMPL_3.csv",
"\002SMPL_4.csv",
"\002SMPL_5.csv",
"\002SMPL_6.csv"]
temp = []
for file in f:
firstdelpos= file.find("_")
lastdelpos = file.find(".")
a = int(file[firstdelpos+1:lastdelpos])
temp.append([file,a])
sorted_list = sorted(temp, key=operator.itemgetter(1))
dfs=[]
dupl = []
for i in range(len(sorted_list)-1):
if(sorted_list[i][1] == sorted_list[i+1][1]):
if(dupl == []):
dupl.append([sorted_list[i][0],sorted_list[i+1][0]])
else:
dupl.append(sorted_list[i+1][0])
else:
if (dupl==[]):
dupl.append(sorted_list[i][0])
dfs.append(dupl)
dupl = []
if(i==len(sorted_list)-2):
if (dupl==[]):
dupl.append(sorted_list[i+1][0])
dfs.append(dupl)
Upvotes: 0
Reputation: 13069
This could be done using a combination of regexp parsing to extract the index number, followed by sorting 2-tuples of (index, filename) and then using itertools.groupby
-- using a key function for groupby
that returns the index number (see the lambda
function below).
import re
from itertools import groupby
def get_index(filename):
match = re.search('_(\d+)\.csv$', filename)
if match:
return int(match.group(1))
else:
return None
def get_filenames_with_index(index_file):
with open(index_file) as f:
for line in f:
filename = line.rstrip('\n')
index = get_index(filename)
if index is not None:
yield (index, filename)
index_file = 'filelist'
dfs = []
for i, files_with_indexes in groupby(sorted(get_filenames_with_index(index_file)),
lambda t:t[0]):
dfs.append([file for index, file in files_with_indexes])
print(dfs)
This gives (when filelist
contains the list shown in the question):
[['\\001SMPL_1.csv', '\\002SMPL_1.csv'], ['\\001SMPL_2.csv', '\\002SMPL_2.csv'], ['\\001SMPL_3.csv', '\\002SMPL_3.csv'], ['\\001SMPL_4.csv', '\\002SMPL_4.csv'], ['\\001SMPL_5.csv', '\\002SMPL_5.csv'], ['\\001SMPL_6.csv', '\\002SMPL_6.csv']]
(Note: it isn't quite clear here from the question whether you have literal backslashes or escape sequences in your filenames, but whatever you do have will be preserved in the output.)
Upvotes: 0
Reputation: 2124
How about having in a dictionary like so. It will deal with the "_xx" section no matter how long the ID number gets.
paths = ["\001SMPL_1.csv","\001SMPL_2.csv","\001SMPL_3.csv","\001SMPL_4.csv","\001SMPL_4.csv","\001SMPL_4.csv","001SMPL_5.csv"]
split_paths = {}
#iterate paths
for path in paths:
#get key without .csv
loc = path.find("_")
key = path[loc:].replace(".csv","")
#add to dictionary
if key in split_paths.keys():
split_paths[key].append(path)
else:
split_paths[key] = [path]
print(split_paths)
output:
{'_1': ['\x01SMPL_1.csv'], '_2': ['\x01SMPL_2.csv'], '_3': ['\x01SMPL_3.csv'], '_4': ['\x01SMPL_4.csv', '\x01SMPL_4.csv', '\x01SMPL_4.csv'], '_5': ['001SMPL_5.csv']}
Then if you really need it in a list.
[v for k,v in split_paths.items()]
output:
[['\x01SMPL_1.csv'],
['\x01SMPL_2.csv'],
['\x01SMPL_3.csv'],
['\x01SMPL_4.csv', '\x01SMPL_4.csv', '\x01SMPL_4.csv'],
['001SMPL_5.csv']]
Upvotes: 1