Reputation: 2831

Match any word in string except those preceded by a curly brace in python

I have a string like

line = u'I need to match the whole line except for {thisword for example'

I have a difficulty doing this. What I've tried and it doesn't work:

# in general case there will be Unicode characters in the pattern
matchobj = re.search(ur'[^\{].+', line) 

matchobj = re.search(ur'(?!\{).+', line)

Could you please help me figure out what's wrong and how to do it right?

P.S. I don't think I need to substitute "{thisword" with empty string

Upvotes: 1

Answers (3)

Grijesh Chauhan

Reputation: 58271

I am not exactly clear what you need. From your question title It looks you wants to find "All words in a string e.g 'line' those doesn't starts with {", but you are using re.search() function that confuses me.

`re.search()` and `re.findall()`

The function re.search() return a corresponding MatchObject instance, re.serach is usually used to match and return a patter in a long string. It doesn't return all possible matches. See below a simple example:

>>> re.search('a', 'aaa').group(0) # only first match
'a'
>>> re.search('a', 'aaa').group(1) # there is no second matched
Traceback (most recent call last):
  File "<console>", line 1, in <module>
IndexError: no such group

With regex 'a' search returns only one patters 'a' in string 'aaa', it doesn't returns all possible matches.

If your objective to find – "all words in a string those doesn't starts with {". You should use re.findall() function:- that matches all occurrences of a pattern, not just the first one as re.search() does. See example:

>>> re.findall('a', 'aaa')
['a', 'a', 'a']

Edit: On the basis of comment adding one more example to demonstrate use of re.search and re.findall:

>>> re.search('a+', 'not itnot baaal laaaaaaall ').group()
'aaa'                 # returns ^^^   ^^^^^ doesn't 
>>> re.findall('a+', 'not itnot baaal laaaaaaall ')
['aaa', 'aaaaaaa']    #          ^^^   ^^^^^^^ match both

Here is a good tutorial for Python re module: re – Regular Expressions

Additionally, there is concept of group in Python-regex – "a matching pattern within parenthesis". If more than one groups are present in your regex patter then re.findall() return a list of groups; this will be a list of tuples if the pattern has more than one group. see below:

>>> re.findall('(a(b))', 'abab') # 2 groups according to 2 pair of ( )
[('ab', 'b'), ('ab', 'b')] # list of tuples of groups captured

In Python regex (a(b)) contains two groups; as two pairs of parenthesis (this is unlike regular expression in formal languages – regex are not exactly same as regular expression in formal languages but that is different matter).

Answer: The words in sentence line are separated by spaces (other either at starts of string) regex should be:

ur"(^|\s)(\w+)

Regex description:

(^|\s+) means: either word at start or start after some spaces.
\w*: Matches an alphanumeric character, including "_".

On applying regex r to your line:

>>> import pprint    # for pretty-print, you can ignore thesis two lines
>>> pp = pprint.PrettyPrinter(indent=4)

>>> r = ur"(^|\s)(\w+)"
>>> L = re.findall(r, line)
>>> pp.pprint(L)
[   (u'', u'I'),
    (u' ', u'need'),
    (u' ', u'to'),
    (u' ', u'match'),
    (u' ', u'the'),
    (u' ', u'whole'),
    (u' ', u'line'),
    (u' ', u'except'),
    (u' ', u'for'),   # notice 'for' after 'for'
    (u' ', u'for'),   # '{thisword' is not included
    (u' ', u'example')]
>>>

To find all words in a single line use:

>>> [t[1] for t in re.findall(r, line)]

Note: it will avoid { or any other special char from line because \w only pass alphanumeric and '_' chars.

If you specifically only avoid { if it appears at start of a word (in middle it is allowed) then use regex: r = ur"(^|\s+)(?P<word>[^{]\S*)".

To understand diffidence between this regex and other is check this example:

>>> r = ur"(^|\s+)(?P<word>[^{]\S*)"
>>> [t[1] for t in re.findall(r, "I am {not yes{ what")]
['I', 'am', 'yes{', 'what']

Without Regex:

You could achieve same thing simply without any regex as follows:

>>> [w for w in line.split() if w[0] != '{']

re.sub() to replace pattern

If you wants to just replace one (or more) words starts with { you should use re.sub() to replace patterns start with { by emplty string "" check following code:

>>> r = ur"{\w+"
>>> re.findall(r, line)
[u'{thisword']
>>> re.sub(r, "", line)
u'I need to match the whole line except for  for example'

Edit Adding Comment's reply:

The (?P<name>...) is Python's Regex extension: (it has meaning in Python) - (?P<name>...) is similar to regular parentheses - create a group (a named group). The group is accessible via the symbolic group name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. example-1:

>>> r = "(?P<capture_all_A>A+)"
>>> mo = re.search(r, "aaaAAAAAAbbbaaaaa")
>>> mo.group('capture_all_A')
'AAAAAA'

example-2: suppose you wants to filter name from a name-line that may contain title also e.g mr use regex: name_re = "(?P<title>(mr|ms)\.?)? ?(?P<name>[a-z ]*)"

we can read name in input string using group('name'):

>>> re.search(name_re, "mr grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "ms. xyz").group('name')
'xyz'

Upvotes: 2

l'L'l

Reputation: 47189

Try this pattern:

(.*)(?:\{\w+)\s(.*)

Code:

import re
p = re.compile(r'(.*)(?:\{\w+)\s(.*)')
str = "I need to match the whole line except for {thisword for example"

p.match(str)

Example:

http://regex101.com/r/wR8eP6

Upvotes: 0

sshashank124

Reputation: 32189

You can simply do:

(?<!{)(\b\w+\b) with the g flag enabled (all matches)

Demo: http://regex101.com/r/zA0sL6

Upvotes: 1

Match any word in string except those preceded by a curly brace in python

Answers (3)

re.search() and re.findall()

Related Questions

`re.search()` and `re.findall()`