Reputation: 13522
I want to parse html documents for links to twitter profiles using a regex and preg_match_all() in PHP. The twitter links are in this form:
http(s)://twitter.com/#!/twitter_name
I only want to grab links that are purely to the profile page ( eg. nothing after the twitter_name ).
I would like to handle both http and https ( because this is common in these links ).
I would also like to handle //www.twitter.com and //twitter.com ( also common ).
How should I structure my regex?
Upvotes: 1
Views: 1174
Reputation: 91608
How about something like:
(https?:)*\/\/(www.)*twitter.com\/#!/([A-Za-z0-9_]*)
I'm not sure what all characters are valid in a Twitter handle, but I'm assuming 0-9, letters and underscores.
Probably best to run it in case-insensitive mode and get rid of the A-Z
as well.
Upvotes: 2
Reputation: 42458
Try the following:
preg_match_all('~https?://(?:www\.)?twitter.com/#!/([a-z0-9_]+)~im', $html, $matches);
$matches[1]
contains the matching user names.
EDIT: For more information on what characters can appear in the user name, see this answer and for more general info see this Twitter Engineering page.
Upvotes: 1
Reputation: 23770
Most general regex (that stops at "/" or space):
(https?:)?\/\/(www\.)?twitter.com\/(#!\/)?([^\/ ].)+
Upvotes: 2
Reputation: 3181
Try
preg_match_all('|https?://(?:www\.)?twitter.com/#!/[a-z0-9_]+|im', $text, $matched)
Don't know exacly what characters can be inside twitter username so I assumed [a-z0-9_]+. $matched[1] should be username.
Upvotes: 1