Reputation:
I need some help for a python program. I've tried so many things, for hours, but it doesn't work.
Anyone who can help me?
This is what I need:
>sp|
, and the species (second line, between the [])
....
Human AAA111
Mouse BBB222
Fruit fly CCC333
What I have so far:
import re
def main():
ReadFile()
file = open ("file.txt", "r")
FilterOnRegEx(file)
def ReadFile():
try:
file = open ("file.txt", "r")
except IOError:
print ("File not found!")
except:
print ("Something went wrong.")
def FilterOnRegEx(file):
f = ("[AG].{4}GK[ST]")
for line in file:
if f in line:
print (line)
main()
You're a hero if you help me out!
Upvotes: 1
Views: 2790
Reputation: 15300
This seems to work on your sample text. I don't know if you can have more than one extract per file, and I'm out of time here, so you'll have to extend it if needed:
#!python3
import re
Extract = {}
def match_notes(line):
global _State
pattern = r"^\s+(.*)$"
m = re.match(pattern, line.rstrip())
if m:
if 'notes' not in Extract:
Extract['notes'] = []
Extract['notes'].append(m.group(1))
return True
else:
_State = match_sp
return False
def match_pattern(line):
global _State
pattern = r"^\s+Pattern: (.*)$"
m = re.match(pattern, line.rstrip())
if m:
Extract['pattern'] = m.group(1)
_State = match_notes
return True
return False
def match_sp(line):
global _State
pattern = r">sp\|([^|]+)\|(.*)$"
m = re.match(pattern, line.rstrip())
if m:
if 'sp' not in Extract:
Extract['sp'] = []
spinfo = {
'accession code': m.group(1),
'other code': m.group(2),
}
Extract['sp'].append(spinfo)
_State = match_sp_note
return True
return False
def match_sp_note(line):
"""Second line of >sp paragraph"""
global _State
pattern = r"^([^[]*)\[([^]]+)\)"
m = re.match(pattern, line.rstrip())
if m:
spinfo = Extract['sp'][-1]
spinfo['note'] = m.group(1).strip()
spinfo['species'] = m.group(2).strip()
spinfo['sequence'] = ''
_State = match_sp_sequence
return True
return False
def match_sp_range(line):
"""Last line of >sp paragraph"""
global _State
pattern = r"^\s+(\d+) - (\d+):\s+(.*)"
m = re.match(pattern, line.rstrip())
if m:
spinfo = Extract['sp'][-1]
spinfo['range'] = (m.group(1), m.group(2))
spinfo['flags'] = m.group(3)
_State = match_sp
return True
return False
def match_sp_sequence(line):
"""Middle block of >sp paragraph"""
global _State
spinfo = Extract['sp'][-1]
if re.match("^\s", line):
# End of sequence. Check for pattern, reset state for sp
if re.match(r"[AG].{4}GK[ST]", spinfo['sequence']):
spinfo['ag_4gkst'] = True
else:
spinfo['ag_4gkst'] = False
_State = match_sp_range
return False
spinfo['sequence'] += line.rstrip()
return True
def match_start(line):
"""Start of outer item"""
global _State
pattern = r"^Hits for ([A-Z]+\d+)|([^:]+) : (?:\[occurs (\w+)\])?"
m = re.match(pattern, line.rstrip())
if m:
Extract['pattern_id'] = m.group(1)
Extract['title'] = m.group(2)
Extract['occurrence'] = m.group(3)
_State = match_pattern
return True
return False
_State = match_start
def process_line(line):
while True:
state = _State
if state(line):
return True
if _State is not state:
continue
if len(line) == 0:
return False
print("Unexpected line:", line)
print("State was:", _State)
return False
def process_file(filename):
with open(filename, "r") as infile:
for line in infile:
process_line(line.rstrip())
process_file("ploop.fa")
import pprint
pprint.pprint(Extract)
Upvotes: 0
Reputation: 12213
My first recommendation is to use a with
statement when opening files:
with open("ploop.fa", "r") as file:
FilterOnRegEx(file)
The problem with your FilterOnRegEx
method is: if ploop in line
. The in
operator, with string arguments, searches the string line
for the exact text in ploop
.
Instead you need to compile the text form to an re object, then search for matches:
def FilterOnRegEx(file):
ploop = ("[AG].{4}GK[ST]")
pattern = re.compile(ploop)
for line in file:
match = pattern.search(line)
if match is not None:
print (line)
This will help you to move forward.
As a next step, I would suggest learning about generators. Printing the lines that match is great, but that doesn't help you to do further operations with them. I might change print
to yield
so that I could then process the data further such as extracting the parts you want and reformatting it for output.
As a simple demonstration:
def FilterOnRegEx(file):
ploop = ("[AG].{4}GK[ST]")
pattern = re.compile(ploop)
for line in file:
match = pattern.search(line)
if match is not None:
yield line
with open("ploop.fa", "r") as file:
for line in FilterOnRegEx(file):
print(line)
Addendum: I ran the code I posted, above, using the sample of the data that you posted and it successfully prints some lines and not others. In other words, the regular expression did match some of the lines and did not match others. So far so good. However, the data you need is not all on one line in the input! That means that filtering individual lines on the pattern is insufficient. (Unless, of course, that I don't see the correct line breaks in the question) The way the data is in the question you'll need to implement a more robust parser with state to know when a record begins, when a record ends, and what any given line is in the middle of a record.
Upvotes: 3