Reputation: 1490
I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641
from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s
):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\"
) it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the ""
:
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
""
in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT: As noticed by @Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!
Upvotes: 3
Views: 79
Reputation: 627100
You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s)
:
^
- string start\d{10}
- ten digits(?=\s)
- a positive lookahead that requires a whitespace char immediately to the right of the current location|
- or (?<=")\d{10}(?=")
:
(?<=")
- a "
char\d{10}
- ten digits(?=")
- a positive lookahead that requires a double quotation mark immediately to the right of the current location.Upvotes: 1
Reputation: 1699
You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))
Upvotes: 1