samthegolden
samthegolden

Reputation: 1490

How to group inside "or" matching in a regex?

I have two kinds of documents to parse:

1545994641 INFO: ...

and

'{"deliveryDate":"1545994641","error"..."}'

I want to extract the timestamp 1545994641 from each of them.

So, I decided to write a regex to match both cases:

(\d{10}\s|\"\d{10}\")

In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):

>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '

(So far so good.)

However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":

>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'

What I tried:

I decided to use a non-capturing group for the quotation marks:

(\d{10}\s|(?:\")\d{10}(?:\"))

but it doesn't work as the outer group catches them.

I also removed the outer group, but the result is the same.

Unwanted ways to solve:

Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?

EDIT: As noticed by @Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!

Upvotes: 3

Views: 79

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627100

You may use lookarounds if your code can only access the whole match:

^\d{10}(?=\s)|(?<=")\d{10}(?=")

See the regex demo.

In Python, declare it as

rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'

Pattern details

  • ^\d{10}(?=\s):
    • ^ - string start
    • \d{10} - ten digits
    • (?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
  • | - or
  • (?<=")\d{10}(?="):
    • (?<=") - a " char
    • \d{10} - ten digits
    • (?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.

Upvotes: 1

dquijada
dquijada

Reputation: 1699

You could use lookarounds, but I think this solution is simpler, if you can just get the group:

"?(\d{10})(?:\"|\s)

EDIT:

Considering if there is a first " there must be a ", try this:

(^\d{10}\s|(?<=\")\d{10}(?=\"))

EDIT 2:

To also remove the trailing space in the end, use a lookahead too:

(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))

Upvotes: 1

Related Questions