jhofman
jhofman

Reputation: 568

python regular expression for retweets

i'm working on a regex that will extract retweet keywords and user names from tweets. here's an example, with a rather terrible regex to do the job:

tweet='foobar RT@one, @two: @three barfoo'
m=re.search(r'(RT|retweet|from|via)\b\W*@(\w+)\b\W*@(\w+)\b\W*@(\w+)\b\W*',tweet)
m.groups()
('RT', 'one', 'two', 'three')

what i'd like is to condense the repeated \b\W*@(\w+)\b\W* patterns and make them of a variable number, so that if @four were added after @three, it would also be extracted. i've tried many permutations to repeat this with a + unsuccessfully.

i'd also like this to work for something like

tweet='foobar RT@one, RT @two: RT @three barfoo';

which can be achieved with a re.finditer if the patterns don't overlap. (i have a version where the patterns do overlap, and so only the first RT gets picked up.)

any help is greatly appreciated. thanks.

Upvotes: 1

Views: 3325

Answers (2)

user23236521
user23236521

Reputation:

A regex that can match most report/repost/modified post things is the following.

\brt\b|#retweet|#modifiedpost|via @|\bmp\b|@\w*:|#rt|#mp|(rt|mt):? @\w*:?| Somthing like that should work to catch them. Other relevant posts are Trying to find Twitter RT's with Regular Expressions and PHP and How to strip off beginning of retweet? Edit: added other responses

Upvotes: 0

MarkusQ
MarkusQ

Reputation: 21950

Try

(RT|retweet|from|via)(?:\b\W*@(\w+))+'

Enclosing the \b\W*@(\w+) in '(?:...)` allows you to group the terms for repetition without capturing the aggregate.

I'm not sure I'm following the second part of your question, but I think you may be looking for something involving a construct like:

(?:(?!RT|@).)

which will match any character that isn't an "@" or the start of "RT", again without capturing it.

In that case, how about:

(RT|retweet|from|via)((?:\b\W*@\w+)+)

and then post process

re.split(r'@(\w+)' ,m.groups()[1])

To get the individual handles?

Upvotes: 3

Related Questions