Reputation: 13522

How can I find twitter profile links with a regex?

I want to parse html documents for links to twitter profiles using a regex and preg_match_all() in PHP. The twitter links are in this form:

http(s)://twitter.com/#!/twitter_name

I only want to grab links that are purely to the profile page ( eg. nothing after the twitter_name ).

I would like to handle both http and https ( because this is common in these links ).

I would also like to handle //www.twitter.com and //twitter.com ( also common ).

How should I structure my regex?

Upvotes: 1

Answers (4)

Reputation: 91608

How about something like:

(https?:)*\/\/(www.)*twitter.com\/#!/([A-Za-z0-9_]*)

I'm not sure what all characters are valid in a Twitter handle, but I'm assuming 0-9, letters and underscores.

Probably best to run it in case-insensitive mode and get rid of the A-Z as well.

Upvotes: 2

Reputation: 42458

Try the following:

preg_match_all('~https?://(?:www\.)?twitter.com/#!/([a-z0-9_]+)~im', $html, $matches);

$matches[1] contains the matching user names.

EDIT: For more information on what characters can appear in the user name, see this answer and for more general info see this Twitter Engineering page.

Upvotes: 1

Reputation: 23770

Most general regex (that stops at "/" or space):

(https?:)?\/\/(www\.)?twitter.com\/(#!\/)?([^\/ ].)+

Upvotes: 2

Reputation: 3181

Try

preg_match_all('|https?://(?:www\.)?twitter.com/#!/[a-z0-9_]+|im', $text, $matched)

Don't know exacly what characters can be inside twitter username so I assumed [a-z0-9_]+. $matched[1] should be username.

Upvotes: 1