ARJ
ARJ

Reputation: 2080

Python regular expression for a string and match them into a dictonnary

I have three files in a directory and I wanted them to be matched with a list of strings to dictionary.

The files in dir looks like following,

DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz
DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz
DB_DEF_S1_001_MM_R1.faq.gz
DB_DEF_S1_001_MM_R2.faq.gz

The list has part of the filename as,

ABC
DEF

So here is what I tried,

import os
import re

dir='/user/home/files'
list='/user/home/list'
samp1     = {}
samp2     = {}

FH_sample = open(list, 'r')
for line in FH_sample:
    samp1[line.strip().split('\n')[0]] =[]
    samp2[line.strip().split('\n')[0]] =[]
FH_sample.close()
for file in os.listdir(dir):
    m1 =re.search('(.*)_R1', file)
    m2 = re.search('(.*)_R2', file)
    if m1 and m1.group(1) in samp1:
        samp1[m1.group(1)].append(file)
    if m2 and m2.group(1) in samp2:
        samp2[m2.group(1)].append(file)

I wanted the above script to find the matches from m1 and m2 and collect them in dictionaries samp1 and samp2. But the above script is not finding the matches, within the if loop. Now the samp1 and samp2 are empty.

This is what the output should look like for samp1 and samp2:

{'ABC': [DB_ABC_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz, DB_ABC_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz], 'DEF': [DB_DEF_S1_001_MM_R1.faq.gz, DB_DEF_S1_001_MM_R2.faq.gz]} 

Any help would be greatly appreciated

Upvotes: 1

Views: 114

Answers (2)

Jack Moody
Jack Moody

Reputation: 1771

A lot of this code you probably don't need. You could just see if the substring that you have from list is in dir.

The code below reads in the data as lists. You seem to have already done this, so it will simply be a matter of replacing files with the file names you read in from dir and replacing st with the substrings from list (which you shouldn't use as a variable name since it is actually used for something else in Python).

files = ["BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz",
         "BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz",
         "BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz",
         "BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz"]
my_strings = ["MOHUA", "MSJLF"]

res = {s: [] for s in my_strings}
for k in my_strings:
    for file in files:
        if k in file:
            res[k].append(file)
print(res)

Upvotes: 2

d_kennetz
d_kennetz

Reputation: 5359

You can pass the python script a dict and provide id_list and then add id_list as dict keys and append the fastqs if the dict key is in the fastq_filename:

import os
import sys

dir_path = sys.argv[1]

fastqs=[]
for x in os.listdir(dir_path):
    if x.endswith(".faq.gz"):
        fastqs.append(x)

id_list = ['MOHUA', 'MSJLF']

sample_dict = dict((sample,[]) for sample in id_list)
print(sample_dict)
for k in sample_dict:
    for z in fastqs:
        if k in z:
            sample_dict[k].append(z)

print(sample_dict)

to run:

python3.6 fq_finder.py /path/to/fastqs

output from above to show what is going on:

{'MOHUA': [], 'MSJLF': []} # first print creates dict with empty list as vals for keys
{'MOHUA': ['BSSE_QGF_1987_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R1_001_MM_1.faq.gz', 'BSSE_QGF_1967_HJUS_1_MOHUA_2_T_bR_r1_v1_0_S1_R2_001_MM_1.faq.gz'], 'MSJLF': ['BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R2_001_MM_1.faq.gz', 'BSSE_QGF_18565_H33HLAFXY_1_MSJLF_T_bulk_RNA_S1_R1_001_MM_1.faq.gz']}

Upvotes: 1

Related Questions