tipu
tipu

Reputation: 9604

regex in python, can this be improved upon?

I have this piece of code that finds words that begin with @ or #,

p = re.findall(r'@\w+|#\w+', str)

Now what irks me about this is repeating \w+. I am sure there is a way to do something like

p = re.findall(r'(@|#)\w+', str)

That will produce the same result but it doesn't, it instead returns only # and @. How can that regex be changed so that I am not repeating the \w+? This code comes close,

p = re.findall(r'((@|#)\w+)', str)

But it returns [('@many', '@'), ('@this', '@'), ('#tweet', '#')] (notice the extra '@', '@', and '#'.

Also, if I'm repeating this re.findall code 500,000 times, can this be compiled and to a pattern and then be faster?

Upvotes: 2

Views: 265

Answers (1)

polygenelubricants
polygenelubricants

Reputation: 383746

The solution

You have two options:

  • Use non-capturing group: (?:@|#)\w+
  • Or even better, a character class: [@#]\w+

References


Understanding findall

The problem you were having is due to how findall return matches depending on how many capturing groups are present.

Let's take a closer look at this pattern (annotated to show the groups):

((@|#)\w+)
|\___/   |
|group 2 |     # Read about groups to understand
\________/     # how they're defined and numbered/named
 group 1

Capturing groups allow us to save the matches in the subpatterns within the overall patterns.

p = re.compile(r'((@|#)\w+)')
m = p.match('@tweet')
print m.group(1)
# @tweet
print m.group(2)
# @

Now let's take a look at the Python documentation for the re module:

findall: Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

This explains why you're getting the following:

str = 'lala @tweet boo #this &that @foo#bar'

print(re.findall(r'((@|#)\w+)', str))
# [('@tweet', '@'), ('#this', '#'), ('@foo', '@'), ('#bar', '#')]

As specified, since the pattern has more than one group, findall returns a list of tuples, one for each match. Each tuple gives you what were captured by the groups for the given match.

The documentation also explains why you're getting the following:

print(re.findall(r'(@|#)\w+', str))
# ['@', '#', '@', '#']

Now the pattern only has one group, and findall returns a list of matches for that group.

In contrast, the patterns given above as solutions doesn't have any capturing groups, which is why they work according to your expectation:

print(re.findall(r'(?:@|#)\w+', str))
# ['@tweet', '#this', '@foo', '#bar']

print(re.findall(r'[@#]\w+', str))
# ['@tweet', '#this', '@foo', '#bar']

References

Attachments

Upvotes: 10

Related Questions