Reputation: 45

Python: How to skip lines that have extra characters while using Regular Expressions?

When using Regular Expressions, how do you select only text from lines that do not have extra text after your text of interest?

For the following input text, I'd like to select only string1 through string10 and skip strings that have "blah" on the same line.

Input text file:

[random lines of text]
DATE/USER: 07/01/15   string1
[random lines of text]
DATE/USER: 07/12/15   string2
[random lines of text]
DATE/USER: 07/04/15   string3
[random lines of text]
DATE/USER: 07/12/15   string4
[random lines of text]
DATE/USER: 07/05/15   string5      * blah1 *
[random lines of text]
DATE/USER: 07/02/15   string6
[random lines of text]
DATE/USER: 07/08/15   string7
[random lines of text]
DATE/USER: 07/11/15   string8      * blah2 *
[random lines of text]
DATE/USER: 07/03/15   string9
[random lines of text]
DATE/USER: 07/10/15   string10      * blah3 *
[random lines of text]

My current code:

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d).+', line)
    if rphfind:
    print rphfind[0].strip()

Output:

string1
string2
string3
string4
string5      * blah1 *
string6
string7
string8      * blah2 *
string9
string10      * blah3 *

Again, only trying to grab the strings and skip those that have "blah" on the same lines. My output should exclude string5, string 8, and string10.

Edit: Apologies. Made a couple edits to refine what I'm asking to achieve.

Upvotes: 2

Answers (3)

Padraic Cunningham

Reputation: 180512

Based on your edit you can definitely split:

with open("in.txt") as f:
    for line in f:
        if line.startswith("DATE/USER:"):
            spl = line.split()
            if len(spl) == 3:
                print(spl[2])

Output:

string1
string2
string3
string4
string6
string7
string9

using re:

with open("in.txt") as f:
    import re
    r = re.compile(r'(^DATE/USER:\s+\d+/\d+/\d+\s+(\w+$))')
    for line in f:
        match = r.search(line)
        if match:
           print(match.group(2))

Output:

string1
string2
string3
string4
string6
string7
string9

Upvotes: 3

rkh

Reputation: 1791

The '$' below will actually exclude any of the lines that have * blah * after them:

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])$', line)

will only match A,B,C,D,F,G,I

The capture group ([A-Z]) will just grab that single capital letter, but will still allow any line to match (prints A through J in your example)

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])', line)

Not exactly sure which version you were looking for

Upvotes: 1

Joran Beasley

Reputation: 114068

re.findall('DATE/USER: \d\d/\d\d/\d\d\s+([A-Z])', line)

Upvotes: 2

Python: How to skip lines that have extra characters while using Regular Expressions?

Answers (3)

Related Questions