Whitespace follows by brackets (non lazy) in Python using regex

I am trying to do the following: from a list of strings extract anything before the first occurrence (there may be more than one) of a whitespace followed by a round bracket "(".

I have tried the following:

re.findall("(.*)\s\(", line))

but it gives the wring results for e.g. the following strings:

Carrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]

Thanks in advance

Upvotes: 1

Views: 171

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

To extract anything before the first occurrence of a whitespace char followed by a round bracket ( you may use re.search (this method is meant to extract the first match only):

re.search(r'^(.*?)\s\(', text, re.S).group(1)
re.search(r'^\S*(?:\s(?!\()\S*)*', text).group()

See regex #1 demo and regex #2 demos. Note the second one - though longer - is much more efficient since it follows the unroll-the-loop principle.

Details

  • ^ - start of string
  • (.*?) - Group 1: any 0+ chars as few as possible,
  • \s\( - a whitespace and ( char.

Or, better:

  • ^\S* - start of string and then 0+ non-whitespace chars
  • (?:\s(?!\()\S*)* - 0 or more occurrences of
    • \s(?!\() - a whitespace char not followed with (
    • \S* - 0+ non-whitespace chars

See Python demo:

import re
strs = ['Isla Vista (University of California, Santa Barbara)[2]','Carrollton (University of West Georgia)[2]','Dahlonega (North Georgia College & State University)[2]']
rx = re.compile(r'^\S*(?:\s(?!\()\S*)*', re.S)
for s in strs:
    m = rx.search(s) 
    if m:
        print('{} => {}'.format(s, m.group()))
    else:
        print("{}: No match!".format(s))

Upvotes: 1

Tshiteej
Tshiteej

Reputation: 121

You can use lookahead for this. Try this regex:

[a-z A-Z]+(?=[ ]+[\(]+)

Upvotes: 1

Related Questions