Reputation: 1307
I am doing the challenges on pythonchallenge.com and I am having trouble with general regex.
For example, if we have the following text:
hello world
<!--
%%$@_$^__#)^)&!_+]!*@&^}@[@%]()%+$&[(_@%+%$*^@$^!+]!&_#)_*}{}}!}_]$[%}@[{_@#_^{*
@##&{#&{&)*%(]{{([*}@[@&]+!!*{)!}{%+{))])[!^})+)$]#{*+^((@^@}$[*a*$&^{$!@#$%)!@(&bc
And I want to get the characters a and b and c into the string (from the above string) (but not hello world) how can I do this?
I understand I can do the following in python:
x = "".join(re.findall("regex", data))
However, I am having problems with the regex expression. I am testing it out on regex tester, and it doesn't seem to be doing what I want it to do
Here is my regex expression
<!--[a-z]*
From my understanding, (after reading regex-expression.info tutorials) this expression should find all characters after the specified string: outputting abc
However, this does not work. It is my understanding that this is not a special character either, as it is not either of [\^$.|?*+().
How can I make this regex expression work for how I want it to? To include abc but not hello world?
Upvotes: 1
Views: 147
Reputation: 27585
import re
su = '''hello world
xxxx hello world yyyy
<!--
_+]!yuyu*@&^}@?!hello world[@%]^@}$[*a*$&^!@(&bc??,=hello'''
print su
pat = '([a-z]+)(?![a-z])(?<!world)'
print "\nexcluding all the words 'world'\n%s" % pat
print re.findall(pat,su)
pat = '([a-z]+)(?![a-z])(?<!\Ahello world)'
print "\nexcluding the word 'world' of the starting string 'hello world'\n%s" % pat
print re.findall(pat,su)
pat = '([a-z]+)(?![a-z])(?<!hello world)'
print "\nexcluding all the words 'world' of a string 'hello world'\n%s" % pat
print re.findall(pat,su)
print '\n-----------'
pat = '([a-z]+)(?![a-z])(?<!hello)'
print "\nexcluding all the words 'hello'\n%s" % pat
print re.findall(pat,su)
pat = '([a-z]+)(?![a-z])(?<!\Ahello)'
print "\nexcluding the starting word 'hello'\n%s" % pat
print re.findall(pat,su)
pat = '([a-z]+)(?![a-z])(?<!hello(?= world))'
print "\nexcluding all the words 'hello' of a string 'hello world'\n%s" % pat
print re.findall(pat,su)
print '\n-----------'
pat = '([a-z]+)(?![a-z])(?<!hello|world)'
print "\nexcluding all the words 'hello' and 'world'\n%s" % pat
print re.findall(pat,su)
pat = '([a-z]+)(?![a-z])(?<!hello(?= world))(?<!hello world)'
print "\nexcluding all the words of a string 'hello world'\n%s" % pat
print re.findall(pat,su)
pat = '([a-z]+)(?![a-z])(?<!\Ahello(?= world))(?<!\Ahello world)'
print "\nexcluding all the words of the starting string 'hello world'\n%s" % pat
print re.findall(pat,su)
result
hello world
xxxx hello world yyyy
<!--
_+]!yuyu*@&^}@?!hello world[@%]^@}$[*a*$&^!@(&bc??,=hello
excluding all the words 'world'
([a-z]+)(?![a-z])(?<!world)
['hello', 'xxxx', 'hello', 'yyyy', 'yuyu', 'hello', 'a', 'bc', 'hello']
excluding the word 'world' of the starting string 'hello world'
([a-z]+)(?![a-z])(?<!\Ahello world)
['hello', 'xxxx', 'hello', 'world', 'yyyy', 'yuyu', 'hello', 'world', 'a', 'bc', 'hello']
excluding all the words 'world' of a string 'hello world'
([a-z]+)(?![a-z])(?<!hello world)
['hello', 'xxxx', 'hello', 'yyyy', 'yuyu', 'hello', 'a', 'bc', 'hello']
-----------
excluding all the words 'hello'
([a-z]+)(?![a-z])(?<!hello)
['world', 'xxxx', 'world', 'yyyy', 'yuyu', 'world', 'a', 'bc']
excluding the starting word 'hello'
([a-z]+)(?![a-z])(?<!\Ahello)
['world', 'xxxx', 'hello', 'world', 'yyyy', 'yuyu', 'hello', 'world', 'a', 'bc', 'hello']
excluding all the words 'hello' of a string 'hello world'
([a-z]+)(?![a-z])(?<!hello(?= world))
['world', 'xxxx', 'world', 'yyyy', 'yuyu', 'world', 'a', 'bc', 'hello']
-----------
excluding all the words 'hello' and 'world'
([a-z]+)(?![a-z])(?<!hello|world)
['xxxx', 'yyyy', 'yuyu', 'a', 'bc']
excluding all the words of a string 'hello world'
([a-z]+)(?![a-z])(?<!hello(?= world))(?<!hello world)
['xxxx', 'yyyy', 'yuyu', 'a', 'bc', 'hello']
excluding all the words of the starting string 'hello world'
([a-z]+)(?![a-z])(?<!\Ahello(?= world))(?<!\Ahello world)
['xxxx', 'hello', 'world', 'yyyy', 'yuyu', 'hello', 'world', 'a', 'bc', 'hello']
And if you want to catch only after a certain pattern in the analyzed string:
print su
print "\ncatching all the lettered strings after <!--"
print "re.compile('^.+?<!--|([a-z]+)',re.DOTALL)"
rgx = re.compile('^.+?<!--|([a-z]+)',re.DOTALL)
print [x.group(1) for x in rgx.finditer(su) if x.group(1)]
print ("\ncatching all the lettered strings after <!--\n"
"excluding all the words 'world'")
print "re.compile('^.+?<!--|([a-z]+)(?<!world)',re.DOTALL)"
rgx = re.compile('^.+?<!--|([a-z]+)(?![a-z])(?<!world)',re.DOTALL)
print [x.group(1) for x in rgx.finditer(su) if x.group(1)]
print ("\ncatching all the lettered strings after <!--\n"
"excluding all the words 'hello'")
print "re.compile('^.+?<!--|([a-z]+)(?<!hello)',re.DOTALL)"
rgx = re.compile('^.+?<!--|([a-z]+)(?![a-z])(?<!hello)',re.DOTALL)
print [x.group(1) for x in rgx.finditer(su) if x.group(1)]
print ("\ncatching all the lettered strings after <!--\n"
"excluding all the words 'hello' belonging to a string 'hello world'")
print "re.compile('^.+?<!--|([a-z]+)(?<!hello(?= world))',re.DOTALL)"
rgx = re.compile('^.+?<!--|([a-z]+)(?![a-z])(?<!hello(?= world))',re.DOTALL)
print [x.group(1) for x in rgx.finditer(su) if x.group(1)]
result
hello world
xxxx hello world yyyy
<!--
_+]!yuyu*@&^}@?!hello world[@%]^@}$[*a*$& <!-- ^!@(&bc??,=hello
catching all the lettered strings after first <!--
re.compile('.+?<!--|([a-z]+)',re.DOTALL)
['yuyu', 'hello', 'world', 'a', 'bc', 'hello']
catching all the lettered strings after first <!--
excluding all the words 'world'
re.compile('.+?<!--|([a-z]+)(?<!world)',re.DOTALL)
['yuyu', 'hello', 'a', 'bc', 'hello']
catching all the lettered strings after first <!--
excluding all the words 'hello'
re.compile('.+?<!--|([a-z]+)(?<!hello)',re.DOTALL)
['yuyu', 'world', 'a', 'bc']
catching all the lettered strings after first <!--
excluding all the words 'hello' belonging to a string 'hello world'
re.compile('.+?<!--|([a-z]+)(?<!hello(?= world))',re.DOTALL)
['yuyu', 'world', 'a', 'bc', 'hello']
Upvotes: 2
Reputation: 251011
>>> import re
>>> print strs = """hello world
<!--
%%$@_$^__#)^)&!_+]!*@&^}@[@%]()%+$&[(_@%+%$*^@$^!+]!&_#)_*}{}}!}_]$[%}@[{_@#_^{*
@##&{#&{&)*%(]{{([*}@[@&]+!!*{)!}{%+{))])[!^})+)$]#{*+^((@^@}$[*a*$&^{$!@#$%)!@(&bc"""
>>> re.findall(r'[a-zA-Z]+',strs.split('<!--')[-1])
['a', 'bc']
Upvotes: 1