Reputation: 568
i'm working on a regex that will extract retweet keywords and user names from tweets. here's an example, with a rather terrible regex to do the job:
tweet='foobar RT@one, @two: @three barfoo'
m=re.search(r'(RT|retweet|from|via)\b\W*@(\w+)\b\W*@(\w+)\b\W*@(\w+)\b\W*',tweet)
m.groups()
('RT', 'one', 'two', 'three')
what i'd like is to condense the repeated \b\W*@(\w+)\b\W*
patterns and make them of a variable number, so that if @four were added after @three, it would also be extracted. i've tried many permutations to repeat this with a +
unsuccessfully.
i'd also like this to work for something like
tweet='foobar RT@one, RT @two: RT @three barfoo';
which can be achieved with a re.finditer if the patterns don't overlap. (i have a version where the patterns do overlap, and so only the first RT gets picked up.)
any help is greatly appreciated. thanks.
Upvotes: 1
Views: 3325
Reputation:
A regex that can match most report/repost/modified post things is the following.
\brt\b|#retweet|#modifiedpost|via @|\bmp\b|@\w*:|#rt|#mp|(rt|mt):? @\w*:?|
Somthing like that should work to catch them.
Other relevant posts are
Trying to find Twitter RT's with Regular Expressions and PHP and How to strip off beginning of retweet?
Edit: added other responses
Upvotes: 0
Reputation: 21950
Try
(RT|retweet|from|via)(?:\b\W*@(\w+))+'
Enclosing the \b\W*@(\w+)
in '(?:...)` allows you to group the terms for repetition without capturing the aggregate.
I'm not sure I'm following the second part of your question, but I think you may be looking for something involving a construct like:
(?:(?!RT|@).)
which will match any character that isn't an "@" or the start of "RT", again without capturing it.
In that case, how about:
(RT|retweet|from|via)((?:\b\W*@\w+)+)
and then post process
re.split(r'@(\w+)' ,m.groups()[1])
To get the individual handles?
Upvotes: 3