Shmoopy
Shmoopy

Reputation: 5534

Simple regular expression in python

I have a text file with two types of lines. One type looks like:

'6-digit-primary-id','6-digit-secondary-id',subject,author,text

The other is just words with no specific pattern. In the former case, I want to know the primary id along with the text and in the latter I want to get the words. What I've tried:

PATTERN = r'[1-9]{6},[1-9]{6},?*,?*,*'
match = re.match(PATTERN,input_line)
if match:
    primary_id = match.group()[0]
    text = match.group()[7]
else:
    text = input_line

But obviously I'm doing something wrong (getting 'invalid syntax')

Can anyone please point me to the right direction?

Upvotes: 1

Views: 84

Answers (2)

unutbu
unutbu

Reputation: 880877

? has a special meaning in regex patterns. It (greedily) matches 0 or 1 of the preceding regex. So ,? matches a comma or no comma. ,?* raises a sre_compile.error.

Perhaps you intended . instead of ?. It matches any character except a newline (unless the re.DOTALL flag is specified).

PATTERN = r'(\d{6}),(\d{6}),(.*?),(.*?),(.*)'
match = re.match(PATTERN, input_line)
if match:
    primary_id = match.group(1)
    text = match.group(5)
else:
    text = input_line

Some other suggestions:

  • You can use \d to specify the character pattern [0-9]. Note that this is adding 0 to your character class. (I assume that is okay). If not you can stick with [1-9]{6}.
  • If you put groups in your regex pattern, then you can specify the parts using match.group(num) instead of match.group()[num]. (And it looks like you want match.group(5) rather than match.group()[7].)
  • The pattern .* matches as many characters as possible. .*? matches non-greedily. You need to match non-greedily for the subject and author patterns, lest they expand to match the remainder of the entire line.
  • An alternative to .*? here would be [^,]*. This matches 0-or-more characters other than a comma.

    PATTERN = r'(\d{6}),(\d{6}),([^,]*),([^,]*),(.*)'
    

Upvotes: 2

Cilyan
Cilyan

Reputation: 8501

In Regular Expressions, * means no, one or more occurrence of the previous character and ? means no or one occurrence of the previous character. So ?* is not a valid expression. You are probably mixing with the .*? operation which means "any character no, one or more time but match the less possible" (non-greedy).

You probably want

PATTERN = r'[1-9]{6},[1-9]{6},.*?,.*?,.*'

Upvotes: 1

Related Questions