Reputation: 676

Split list of filepaths into sub lists based on sub string in filepath

I have a large list file paths pointing to csv files.

...SomeFileIDon'tNeed.csv
\001SMPL_1.csv
\001SMPL_2.csv
\001SMPL_3.csv
\001SMPL_4.csv
\001SMPL_5.csv
\001SMPL_6.csv
\002SMPL_1.csv
\002SMPL_2.csv
\002SMPL_3.csv
\002SMPL_4.csv
\002SMPL_5.csv
\002SMPL_6.csv

I want to split this list into sub-lists like so

[["\001SMPL_1.csv","\002SMPL_1.csv"],["\001SMPL_2.csv","\002SMPL_2.csv"],["\001SMPL_3.csv","\002SMPL_3.csv"],...]

This is because I need to combine all similar files into dataframes. So all files ending in _1 will be one df and files ending _2 will be another df, etc.

I wrote code that will split the original list into the sub-lists I want but it's not very efficient. I'm looking for a better way

#where f is the original list of file paths
dfs=[]
for i in range(len(f)):
    temp=[]
    for file in f:
        count="_"+str(i)
        if count in file:
            temp.append(file)
    dfs.append(temp)

Upvotes: 2

Answers (3)

Chaitu Boggavarapu

Reputation: 181

import operator
f = ["\001SMPL_1.csv",
"\001SMPL_2.csv",
"\001SMPL_3.csv",
"\001SMPL_4.csv",
"\001SMPL_5.csv",
"\001SMPL_6.csv",
"\002SMPL_1.csv",
"\002SMPL_2.csv",
"\002SMPL_3.csv",
"\002SMPL_4.csv",
"\002SMPL_5.csv",
"\002SMPL_6.csv"]

temp = []
for file in f:
    firstdelpos= file.find("_")
    lastdelpos = file.find(".")
    a = int(file[firstdelpos+1:lastdelpos])
    temp.append([file,a])
sorted_list = sorted(temp, key=operator.itemgetter(1))        
dfs=[]
dupl = []
for i in range(len(sorted_list)-1):
    if(sorted_list[i][1] == sorted_list[i+1][1]):
        if(dupl == []):     
            dupl.append([sorted_list[i][0],sorted_list[i+1][0]])    
        else:
            
            dupl.append(sorted_list[i+1][0])
    else:
        if (dupl==[]):
            dupl.append(sorted_list[i][0])        
        dfs.append(dupl)
        dupl = []
    if(i==len(sorted_list)-2):
                
        if (dupl==[]):
            dupl.append(sorted_list[i+1][0])        
        dfs.append(dupl)

Upvotes: 0

alani

Reputation: 13069

This could be done using a combination of regexp parsing to extract the index number, followed by sorting 2-tuples of (index, filename) and then using itertools.groupby -- using a key function for groupby that returns the index number (see the lambda function below).

import re
from itertools import groupby

def get_index(filename):
    match = re.search('_(\d+)\.csv$', filename)
    if match:
        return int(match.group(1))
    else:
        return None


def get_filenames_with_index(index_file):
    with open(index_file) as f:
        for line in f:
            filename = line.rstrip('\n')
            index = get_index(filename)
            if index is not None:
                yield (index, filename)


index_file = 'filelist'

dfs = []
for i, files_with_indexes in groupby(sorted(get_filenames_with_index(index_file)),
                                     lambda t:t[0]):
    
    dfs.append([file for index, file in files_with_indexes])

print(dfs)

This gives (when filelist contains the list shown in the question):

[['\\001SMPL_1.csv', '\\002SMPL_1.csv'], ['\\001SMPL_2.csv', '\\002SMPL_2.csv'], ['\\001SMPL_3.csv', '\\002SMPL_3.csv'], ['\\001SMPL_4.csv', '\\002SMPL_4.csv'], ['\\001SMPL_5.csv', '\\002SMPL_5.csv'], ['\\001SMPL_6.csv', '\\002SMPL_6.csv']]

(Note: it isn't quite clear here from the question whether you have literal backslashes or escape sequences in your filenames, but whatever you do have will be preserved in the output.)

Upvotes: 0

Lewis Morris

Reputation: 2124

How about having in a dictionary like so. It will deal with the "_xx" section no matter how long the ID number gets.

paths = ["\001SMPL_1.csv","\001SMPL_2.csv","\001SMPL_3.csv","\001SMPL_4.csv","\001SMPL_4.csv","\001SMPL_4.csv","001SMPL_5.csv"]


split_paths = {}

#iterate paths
for path in paths:
    #get key without .csv
    loc = path.find("_")
    key = path[loc:].replace(".csv","")
    #add to dictionary
    if key in split_paths.keys():
        split_paths[key].append(path)
    else:
        split_paths[key] = [path]

print(split_paths)

output:

{'_1': ['\x01SMPL_1.csv'], '_2': ['\x01SMPL_2.csv'], '_3': ['\x01SMPL_3.csv'], '_4': ['\x01SMPL_4.csv', '\x01SMPL_4.csv', '\x01SMPL_4.csv'], '_5': ['001SMPL_5.csv']}

Then if you really need it in a list.

[v for k,v in split_paths.items()]

output:

[['\x01SMPL_1.csv'],
 ['\x01SMPL_2.csv'],
 ['\x01SMPL_3.csv'],
 ['\x01SMPL_4.csv', '\x01SMPL_4.csv', '\x01SMPL_4.csv'],
 ['001SMPL_5.csv']]

Upvotes: 1

Split list of filepaths into sub lists based on sub string in filepath

Answers (3)

Related Questions