Reputation: 1547
I have a file with lines that looks like this:
chr5 153584000 153599999 D16073_orphan_reads.fa;709[F18|R11] unkn 1 unkn 2509
chr7 153764000 153775999 D16073_orphan_reads.fa;710[F9|R21],14892_orphan_reads.fa;229[F19|R16] unkn 1 unkn 2510
chr3 127848000 127871999 B15971_orphan_reads.fa;172[F35|R6],D16157-14_orphan_reads.fa;183[F6|R13],14892_orphan_reads.fa;229[F19|R16],USP19283_orphan_reads.fa;336[F10|R6],D15927-14_orphan_reads.fa;176[F11|R10],1007,1007 46 1007 1658
(...)
I want to create a Regex that takes the fasta file (.fa) name for each line ( sometimes I have more than one file per line).
I would like to end up with something like:
D16073_orphan_reads.fa
D16073_orphan_reads.fa, 14892_orphan_reads.fa
B15971_orphan_reads.fa, D1615714_orphan_reads.fa, 14892_orphan_reads.fa,USP19283_orphan_reads.fa, D15927-14_orphan_reads.fa
I tried:
pattern= re.search(".+.[.fa]", line)
The problem is that the file names have very irregular names. The only clues are:
-end with .fa
-start after the comma
thanks
Upvotes: 3
Views: 433
Reputation: 829
Try this pattern ((?=\w+)[\w-]+\.fa)
See demo here https://regex101.com/r/uJ0vD4/3
Explanation
(?=\w+)
: checks to see if there are one or more words, if so, match .
[\w-]+
: This is what is captured after, the lookahead. Either one or more word or -
\.fa
: .fa is matched after all the conditions have been satisfied
Upvotes: 0
Reputation: 5875
The regex ([\w-]+\.fa);
used in an re.findall()
call will accomplish this.
import re
data = '''chr5 153584000 153599999 D16073_orphan_reads.fa;709[F18|R11] unkn 1 unkn 2509
chr7 153764000 153775999 D16073_orphan_reads.fa;710[F9|R21],14892_orphan_reads.fa;229[F19|R16] unkn 1 unkn 2510
chr3 127848000 127871999 B15971_orphan_reads.fa;172[F35|R6],D16157-14_orphan_reads.fa;183[F6|R13],14892_orphan_reads.fa;229[F19|R16],USP19283_orphan_reads.fa;336[F10|R6],D15927-14_orphan_reads.fa;176[F11|R10],1007,1007 46 1007 1658
'''
for line in data.splitlines():
filenames = re.findall('([\w|-]+\.fa);', line)
if filenames:
print ', '.join(filenames)
output:
D16073_orphan_reads.fa
D16073_orphan_reads.fa, 14892_orphan_reads.fa
B15971_orphan_reads.fa, D16157-14_orphan_reads.fa, 14892_orphan_reads.fa, USP19283_orphan_reads.fa, D15927-14_orphan_reads.fa
Upvotes: 1