Reputation: 45
When using Regular Expressions, how do you select only text from lines that do not have extra text after your text of interest?
For the following input text, I'd like to select only string1 through string10 and skip strings that have "blah" on the same line.
Input text file:
[random lines of text]
DATE/USER: 07/01/15 string1
[random lines of text]
DATE/USER: 07/12/15 string2
[random lines of text]
DATE/USER: 07/04/15 string3
[random lines of text]
DATE/USER: 07/12/15 string4
[random lines of text]
DATE/USER: 07/05/15 string5 * blah1 *
[random lines of text]
DATE/USER: 07/02/15 string6
[random lines of text]
DATE/USER: 07/08/15 string7
[random lines of text]
DATE/USER: 07/11/15 string8 * blah2 *
[random lines of text]
DATE/USER: 07/03/15 string9
[random lines of text]
DATE/USER: 07/10/15 string10 * blah3 *
[random lines of text]
My current code:
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d).+', line)
if rphfind:
print rphfind[0].strip()
Output:
string1
string2
string3
string4
string5 * blah1 *
string6
string7
string8 * blah2 *
string9
string10 * blah3 *
Again, only trying to grab the strings and skip those that have "blah" on the same lines. My output should exclude string5, string 8, and string10.
Edit: Apologies. Made a couple edits to refine what I'm asking to achieve.
Upvotes: 2
Views: 602
Reputation: 180512
Based on your edit you can definitely split:
with open("in.txt") as f:
for line in f:
if line.startswith("DATE/USER:"):
spl = line.split()
if len(spl) == 3:
print(spl[2])
Output:
string1
string2
string3
string4
string6
string7
string9
using re:
with open("in.txt") as f:
import re
r = re.compile(r'(^DATE/USER:\s+\d+/\d+/\d+\s+(\w+$))')
for line in f:
match = r.search(line)
if match:
print(match.group(2))
Output:
string1
string2
string3
string4
string6
string7
string9
Upvotes: 3
Reputation: 1791
The '$' below will actually exclude any of the lines that have * blah * after them:
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])$', line)
will only match A,B,C,D,F,G,I
The capture group ([A-Z]) will just grab that single capital letter, but will still allow any line to match (prints A through J in your example)
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])', line)
Not exactly sure which version you were looking for
Upvotes: 1