John Rene
John Rene

Reputation: 33

Match first parenthesis with Python

From a string such as

70849   mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30

I want to get the first parenthesized content linux;u;android4.2.1;zh-cn.

My code looks like this:

s=r'70849   mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
re.search("(\d+)\s.+\((\S+)\)", s).group(2)

but the result is the last brackets' contents khtml,likegecko.

How to solve this?

Upvotes: 2

Views: 1517

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

The main issue you have is the greedy dot matching .+ pattern. It grabs the whole string you have, and then backtracks, yielding one character from the right at a time, trying to accommodate for the subsequent patterns. Thus, it matches the last parentheses.

You can use

^(\d+)\s[^(]+\(([^()]+)\)

See the regex demo. Here, the [^(]+ restricts the matching to the characters other than ( (so, it cannot grab the whole line up to the end) and get to the first pair of parentheses.

Pattern expalantion:

  • ^ - string start (NOTE: If the number appears not at the start of the string, remove this ^ anchor)
  • (\d+) - Group 1: 1 or more digits
  • \s - a whitespace (if it is not a required character, it can be removed since the subsequent negated character class will match the space)
  • [^(]+ - 1+ characters other than (
  • \( - a literal (
  • ([^()]+) - Group 2 matching 1+ characters other than ( and )
  • \)- closing ).

Regular expression visualization

Debuggex Demo

Here is the IDEONE demo:

import re
p = re.compile(r'^(\d+)\s[^(]+\(([^()]+)\)')
test_str = "70849   mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30"
print(p.findall(test_str))
# or using re.search if the number is not at the beginning of the string
m = re.search(r'(\d+)\s[^(]+\(([^()]+)\)', test_str)
if m:
    print("Number: {0}\nString: {1}".format(m.group(1), m.group(2)))
# [('70849', 'linux;u;android4.2.1;zh-cn')]
# Number: 70849
# String: linux;u;android4.2.1;zh-cn

Upvotes: 2

anubhava
anubhava

Reputation: 785058

You can use a negated class \(([^)]*)\) to match anything between ( and ):

>>> s=r'70849  mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'

>>> m = re.search(r"(\d+)[^(]*\(([^)]*)\)", s)
>>> print m.group(1)
70849
>>> print m.group(2)
linux;u;android4.2.1;zh-cn

Upvotes: 1

Related Questions