adrian Coye
adrian Coye

Reputation: 173

The best regex to parse Twitter #hashtags and @users

Here is what I quickly came up with. It works with regexKitLite on the iPhone:

#define kUserRegex @"((?:@){1}[0-9a-zA-Z_]{1,15})";

Twitter only allows letters/numbers, underscores _, and a max of 15 chars (without @). My regex seems fine but reports false positives on e-mail addresses.

#define kHashtagRegex @"((?:#){1}[0-9a-zA-Z_àáâãäåçèéêëìíîïðòóôõöùúûüýÿ]{1,140})";

kHashtagRegex works with accentuated words but it is not enough for UTF-8 words. What is the 'tech spec' of a hashtag?

Is there a reference somewhere on what to use for parsing these? Or do you have advice on how to enhance this regex?

Upvotes: 2

Views: 2109

Answers (2)

Mob
Mob

Reputation: 11098

REGEX_HASHTAG = '/(^|[^0-9A-Z&\/\?]+)([##]+)([0-9A-Z_]*[A-Z_]+[a-z0-9_üÀ-ÖØ-öø-ÿ]*)/iu';`

Upvotes: 0

Jacob Eggers
Jacob Eggers

Reputation: 9322

I'm not sure if this is complete, bu this is what I would do:


For the username, Add a check for whitespace/start of string before the @ to eliminate emails (?:^|\s):

#define kUserRegex @"((?:^|\s)(?:@){1}[0-9a-zA-Z_]{1,15})";

for the hash tags, I would just say \w or \d

#define kHashtagRegex @"((?:#){1}[\w\d]{1,140})";

Upvotes: 4

Related Questions