Reputation: 9604
I have this piece of code that finds words that begin with @ or #,
p = re.findall(r'@\w+|#\w+', str)
Now what irks me about this is repeating \w+. I am sure there is a way to do something like
p = re.findall(r'(@|#)\w+', str)
That will produce the same result but it doesn't, it instead returns only #
and @
. How can that regex be changed so that I am not repeating the \w+
? This code comes close,
p = re.findall(r'((@|#)\w+)', str)
But it returns [('@many', '@'), ('@this', '@'), ('#tweet', '#')]
(notice the extra '@', '@', and '#'.
Also, if I'm repeating this re.findall
code 500,000 times, can this be compiled and to a pattern and then be faster?
Upvotes: 2
Views: 265
Reputation: 383746
You have two options:
(?:@|#)\w+
[@#]\w+
findall
The problem you were having is due to how findall
return matches depending on how many capturing groups are present.
Let's take a closer look at this pattern (annotated to show the groups):
((@|#)\w+)
|\___/ |
|group 2 | # Read about groups to understand
\________/ # how they're defined and numbered/named
group 1
Capturing groups allow us to save the matches in the subpatterns within the overall patterns.
p = re.compile(r'((@|#)\w+)')
m = p.match('@tweet')
print m.group(1)
# @tweet
print m.group(2)
# @
Now let's take a look at the Python documentation for the re
module:
findall
: Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This explains why you're getting the following:
str = 'lala @tweet boo #this &that @foo#bar'
print(re.findall(r'((@|#)\w+)', str))
# [('@tweet', '@'), ('#this', '#'), ('@foo', '@'), ('#bar', '#')]
As specified, since the pattern has more than one group, findall
returns a list of tuples, one for each match. Each tuple gives you what were captured by the groups for the given match.
The documentation also explains why you're getting the following:
print(re.findall(r'(@|#)\w+', str))
# ['@', '#', '@', '#']
Now the pattern only has one group, and findall
returns a list of matches for that group.
In contrast, the patterns given above as solutions doesn't have any capturing groups, which is why they work according to your expectation:
print(re.findall(r'(?:@|#)\w+', str))
# ['@tweet', '#this', '@foo', '#bar']
print(re.findall(r'[@#]\w+', str))
# ['@tweet', '#this', '@foo', '#bar']
Upvotes: 10