Ryflex
Ryflex

Reputation: 5769

Regex pattern to stop only accept what's left on that line

My following data:

'DOMA A\r\nName: Ryan\r\nBest: 1\r\nAlias: 3K\r\nLocation: Eng\r\nGame Wins: 51\r\nTime: 09:10:50'

Has some problems when using regex patterns to find everything...

pattern1 = re.compile('DOMA: (.*)\r\n')
pattern2 = re.compile('Name: (.*)\r\n')
pattern3 = re.compile('Best: (.*)\r\n')
pattern4 = re.compile('Location: (.*)\r\n')
pattern5 = re.compile('Game Wins: (.*)\r\n')
pattern6 = re.compile('Time: (.*)')

All of the above work however sometimes my data looks like: 'DOMA A\r\nName: Ryan\r\nBest: 1\r\nAlias: 3K\r\nLocation: Eng\r\nGame Wins: 51\r\nTime: 09:10:50\r\nREF: Yes'

Pattern6, returns incorrectly because it doesn't have /r/n... how can I get around this so that it only returns what's on it's current line...~

Is pattern 6 supposed to be like:

pattern6 = re.compile(r'Time: (.*)')

or

pattern6 = re.compile('Time: (.*?)')

or

pattern6 = re.compile(r'Time: (.*?)')

Thanks in advance - Hyflex

Upvotes: 0

Views: 99

Answers (2)

Mike Housky
Mike Housky

Reputation: 4069

This the the sort of problem that re.MULTILINE (re.M for short) was made for. Compile the pattern as:

pattern6 = re.compile(r"Time: .*$", flags=re.M)

You can make that more specific by using r"^Time: .*$", requiring "Time: " to start a line, or add some leading space tolerance with r"^\s*Time: .*$".

Maybe this is paranoid, but the first thing I'd do before searching is filter out the \r\n newlines. I don't have to do this on Windows Python 2.7, but I don't see a guarantee in the docs that all environments will treat \r\n and \n equivalently. The easy way to do that is re.sub("\r\n", "\n", s) to replace every "\r\n" in s with a "\n". [Note: The easier way is to use s.replace(), but as I said in the comments, this works.]

s1 = 'DOMA A\r\nName: Ryan\r\nBest: 1\r\nAlias: 3K\r\nLocation: Eng\r\nGame Wins: 51\r\nTime: 09:10:50'
s2 = 'DOMA A\r\nName: Ryan\r\nBest: 1\r\nAlias: 3K\r\nLocation: Eng\r\nGame Wins: 51\r\nTime: 09:10:50\r\nREF: Yes'

print "s1: ", pattern6.findall( re.sub('\r\n', '\n', s1) )
print "s2: ", pattern6.findall( re.sub('\r\n', '\n', s2) )

Output:

s1:  ['Time: 09:10:50']
s2:  ['Time: 09:10:50']

Another advantage here is that ^ and $ don't capture anything, so you don't end up with the \r\n being part of the match, and you don't need to add parentheses to make that happen.

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142116

Make the delimiter \r\n or $ (which means "end of string" in a regex) - also - instead of multiple patterns, just use one generic pattern, and put it in a dictionary, then extract the named parts after:

s = 'DOMA A\r\nName: Ryan\r\nBest: 1\r\nAlias: 3K\r\nLocation: Eng\r\nGame Wins: 51\r\nTime: 09:10:50'
import re
res = dict(re.findall(r'(.*?): (.*?)(?:\r\n|$)', s))
# {'Name': 'Ryan', 'Alias': '3K', 'Location': 'Eng', 'Time': '09:10:50', 'Game Wins': '51', 'Best': '1'}

Upvotes: 3

Related Questions