user5520815
user5520815

Reputation:

How to extract data from a text file based on a regular expression pattern

I need some help for a python program. I've tried so many things, for hours, but it doesn't work.

Anyone who can help me?

This is what I need:

....

Human                         AAA111
Mouse                         BBB222
Fruit fly                     CCC333

What I have so far:

import re

def main():
    ReadFile()
    file = open ("file.txt", "r")
    FilterOnRegEx(file)

def ReadFile():
    try:
        file = open ("file.txt", "r")
    except IOError:
        print ("File not found!")
    except:
        print ("Something went wrong.")

def FilterOnRegEx(file):
    f = ("[AG].{4}GK[ST]")
    for line in file:
        if f in line:
            print (line)


main()

You're a hero if you help me out!

Upvotes: 1

Views: 2790

Answers (2)

aghast
aghast

Reputation: 15300

This seems to work on your sample text. I don't know if you can have more than one extract per file, and I'm out of time here, so you'll have to extend it if needed:

#!python3
import re

Extract = {}

def match_notes(line):
    global _State
    pattern = r"^\s+(.*)$"
    m = re.match(pattern, line.rstrip())
    if m:
        if 'notes' not in Extract:
            Extract['notes'] = []

        Extract['notes'].append(m.group(1))
        return True
    else:
        _State = match_sp
        return False

def match_pattern(line):
    global _State
    pattern = r"^\s+Pattern: (.*)$"
    m = re.match(pattern, line.rstrip())
    if m:
        Extract['pattern'] = m.group(1)
        _State = match_notes
        return True
    return False

def match_sp(line):
    global _State
    pattern = r">sp\|([^|]+)\|(.*)$"
    m = re.match(pattern, line.rstrip())
    if m:
        if 'sp' not in Extract:
            Extract['sp'] = []
        spinfo = {
            'accession code': m.group(1),
            'other code': m.group(2),
        }
        Extract['sp'].append(spinfo)
        _State = match_sp_note
        return True
    return False

def match_sp_note(line):
    """Second line of >sp paragraph"""
    global _State
    pattern = r"^([^[]*)\[([^]]+)\)"
    m = re.match(pattern, line.rstrip())
    if m:
        spinfo = Extract['sp'][-1]
        spinfo['note'] = m.group(1).strip()
        spinfo['species'] = m.group(2).strip()
        spinfo['sequence'] = ''
        _State = match_sp_sequence
        return True
    return False

def match_sp_range(line):
    """Last line of >sp paragraph"""
    global _State
    pattern = r"^\s+(\d+) - (\d+):\s+(.*)"
    m = re.match(pattern, line.rstrip())
    if m:
        spinfo = Extract['sp'][-1]
        spinfo['range'] = (m.group(1), m.group(2))
        spinfo['flags'] = m.group(3)
        _State = match_sp
        return True
    return False

def match_sp_sequence(line):
    """Middle block of >sp paragraph"""
    global _State

    spinfo = Extract['sp'][-1]

    if re.match("^\s", line):
        # End of sequence. Check for pattern, reset state for sp
        if re.match(r"[AG].{4}GK[ST]", spinfo['sequence']):
            spinfo['ag_4gkst'] = True
        else:
            spinfo['ag_4gkst'] = False

        _State = match_sp_range
        return False

    spinfo['sequence'] += line.rstrip()
    return True

def match_start(line):
    """Start of outer item"""
    global _State
    pattern = r"^Hits for ([A-Z]+\d+)|([^:]+) : (?:\[occurs (\w+)\])?"
    m = re.match(pattern, line.rstrip())
    if m:
        Extract['pattern_id'] = m.group(1)
        Extract['title'] = m.group(2)
        Extract['occurrence'] = m.group(3)
        _State = match_pattern
        return True
    return False

_State = match_start

def process_line(line):
    while True:
        state = _State
        if state(line):
            return True

        if _State is not state:
            continue

        if len(line) == 0:
            return False

        print("Unexpected line:", line)
        print("State was:", _State)
        return False

def process_file(filename):
    with open(filename, "r") as infile:
        for line in infile:
            process_line(line.rstrip())

process_file("ploop.fa")
import pprint
pprint.pprint(Extract)

Upvotes: 0

dsh
dsh

Reputation: 12213

My first recommendation is to use a with statement when opening files:

with open("ploop.fa", "r") as file:
    FilterOnRegEx(file)

The problem with your FilterOnRegEx method is: if ploop in line. The in operator, with string arguments, searches the string line for the exact text in ploop.

Instead you need to compile the text form to an re object, then search for matches:

def FilterOnRegEx(file):
    ploop = ("[AG].{4}GK[ST]")
    pattern = re.compile(ploop)
    for line in file:
        match = pattern.search(line)
        if match is not None:
            print (line)

This will help you to move forward.

As a next step, I would suggest learning about generators. Printing the lines that match is great, but that doesn't help you to do further operations with them. I might change print to yield so that I could then process the data further such as extracting the parts you want and reformatting it for output.

As a simple demonstration:

def FilterOnRegEx(file):
    ploop = ("[AG].{4}GK[ST]")
    pattern = re.compile(ploop)
    for line in file:
        match = pattern.search(line)
        if match is not None:
            yield line

with open("ploop.fa", "r") as file:
    for line in FilterOnRegEx(file):
        print(line)


Addendum: I ran the code I posted, above, using the sample of the data that you posted and it successfully prints some lines and not others. In other words, the regular expression did match some of the lines and did not match others. So far so good. However, the data you need is not all on one line in the input! That means that filtering individual lines on the pattern is insufficient. (Unless, of course, that I don't see the correct line breaks in the question) The way the data is in the question you'll need to implement a more robust parser with state to know when a record begins, when a record ends, and what any given line is in the middle of a record.

Upvotes: 3

Related Questions