python regex matching file names

Question

I have the below file names and would like to group them:

Group1:

C7_S6_L001.sorted.bam
C7_S6_L002.sorted.bam
C7_S6_L003.sorted.bam
C7_S6_L004.sorted.bam

Group2:

CL3_S8_L001.sorted.bam
CL3_S8_L002.sorted.bam
CL3_S8_L003.sorted.bam
CL3_S8_L004.sorted.bam

Group3:

CL5-B1_S4_L001.sorted.bam
CL5-B1_S4_L002.sorted.bam
CL5-B1_S4_L003.sorted.bam
CL5-B1_S4_L004.sorted.bam

How would the regular expression looks for it?

Thank you in advance.

alecxe · Accepted Answer

Assuming the grouping key is everything before the _L and digits at the beginning of file names, you can use the following regular expression to capture the group keys using saving groups:

([A-Z0-9-_]+)_L\d{3}\.sorted\.bam

Working example using defaultdict collection:

from collections import defaultdict
from pprint import pprint
import re

filenames = [
    "C7_S6_L001.sorted.bam",
    "C7_S6_L002.sorted.bam",
    "C7_S6_L003.sorted.bam",
    "C7_S6_L004.sorted.bam",

    "CL3_S8_L001.sorted.bam",
    "CL3_S8_L002.sorted.bam",
    "CL3_S8_L003.sorted.bam",
    "CL3_S8_L004.sorted.bam",

    "CL5-B1_S4_L001.sorted.bam",
    "CL5-B1_S4_L002.sorted.bam",
    "CL5-B1_S4_L003.sorted.bam",
    "CL5-B1_S4_L004.sorted.bam"
]

pattern = re.compile(r"([A-Z0-9-_]+)_L\d{3}\.sorted\.bam")
grouped = defaultdict(list)

for filename in filenames:
    match = pattern.search(filename)
    if match:
        key = match.group(1)
        grouped[key].append(filename)

pprint(grouped)

Prints:

defaultdict(,
            {'C7_S6': ['C7_S6_L001.sorted.bam',
                       'C7_S6_L002.sorted.bam',
                       'C7_S6_L003.sorted.bam',
                       'C7_S6_L004.sorted.bam'],
             'CL3_S8': ['CL3_S8_L001.sorted.bam',
                        'CL3_S8_L002.sorted.bam',
                        'CL3_S8_L003.sorted.bam',
                        'CL3_S8_L004.sorted.bam'],
             'CL5-B1_S4': ['CL5-B1_S4_L001.sorted.bam',
                           'CL5-B1_S4_L002.sorted.bam',
                           'CL5-B1_S4_L003.sorted.bam',
                           'CL5-B1_S4_L004.sorted.bam']})

python regex matching file names

Answers (1)

Related Questions