Reputation: 5534
I have a text file with two types of lines. One type looks like:
'6-digit-primary-id','6-digit-secondary-id',subject,author,text
The other is just words with no specific pattern. In the former case, I want to know the primary id along with the text and in the latter I want to get the words. What I've tried:
PATTERN = r'[1-9]{6},[1-9]{6},?*,?*,*'
match = re.match(PATTERN,input_line)
if match:
primary_id = match.group()[0]
text = match.group()[7]
else:
text = input_line
But obviously I'm doing something wrong (getting 'invalid syntax')
Can anyone please point me to the right direction?
Upvotes: 1
Views: 84
Reputation: 880877
?
has a special meaning in regex patterns. It (greedily) matches 0 or 1 of the preceding regex. So ,?
matches a comma or no comma. ,?*
raises a sre_compile.error
.
Perhaps you intended .
instead of ?
. It matches any character except a newline (unless the re.DOTALL flag is specified).
PATTERN = r'(\d{6}),(\d{6}),(.*?),(.*?),(.*)'
match = re.match(PATTERN, input_line)
if match:
primary_id = match.group(1)
text = match.group(5)
else:
text = input_line
Some other suggestions:
\d
to specify the character pattern [0-9]
. Note that this is adding 0
to your character class. (I assume that is okay). If not you can stick with [1-9]{6}
.match.group(num)
instead of match.group()[num]
. (And it looks like you want match.group(5)
rather than match.group()[7]
.).*
matches as many characters as possible. .*?
matches non-greedily. You need to match non-greedily for the subject and author patterns, lest they expand to match the remainder of the entire line.An alternative to .*?
here would be [^,]*
. This matches 0-or-more
characters other than a comma.
PATTERN = r'(\d{6}),(\d{6}),([^,]*),([^,]*),(.*)'
Upvotes: 2
Reputation: 8501
In Regular Expressions, *
means no, one or more occurrence of the previous character and ?
means no or one occurrence of the previous character. So ?*
is not a valid expression. You are probably mixing with the .*?
operation which means "any character no, one or more time but match the less possible" (non-greedy).
You probably want
PATTERN = r'[1-9]{6},[1-9]{6},.*?,.*?,.*'
Upvotes: 1