GENE
GENE

Reputation: 193

extract specific URLs using regular expressions

I want to extract URLs from a lot of text using regex, those URL has specific pattern as follows:

http://***/i/***/***
http://***/t/***/***

which means any Link with this form:

( http://domaine.com/i/text/text ) 

or this form:

( http://domaine.com/t/text/text )

need to be extracted.

What i did so far is i creating this regular expression :

/https?:\/\/(.+?)\/[t|i]\/(.+?)\/(.+)/

it is working well till now, but i feel that it is over simplified to be used in production, and it can not be used for this particular situation.

So what i need is another good regex or an improvement of this one in case you see it is not convenient to solve my issue.

Upvotes: 0

Views: 168

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

Your pattern isn't really bad, you can improve it depending of the context (amount of text in particular, variation of the URL structure you didn't fully describe in your question, other):

first thing: change the delimiters!, this way you avoid to become blind

~https?://(.+?)/[t|i]/(.+?)/(.+)~

[t|i] means: a t or a | or a i, it doesn't means a t or a i; it's a character class, not a group:

~https?://(.+?)/[ti]/(.+?)/(.+)~

You don't need to capture or group anything, remove the groups if it is the case:

~https?://.+?/[ti]/.+?/.+~

Non-greedy quantifiers with a dot are slower than a negated character class with a greedy quantifier. Other thing, non-greedy quantifiers with the dot don't prevent to match a slash (or anything else if the first url of the line doesn't match /[ti]/[^/]+/.+ and there is another one that does at the end of the line):

~https?://[^/]+/[ti]/[^/]+/.+~

(If you are afraid that [^/]+ matches a newline character, exclude it from the character class: [^/\n]+)

Better than the last .+ , you should use \S+ (or something more restrictive, perhaps [^\s?/]+)

~https?://[^/]+/[ti]/[^/]+/\S+~

To finish: sometimes it can be useful to start with a word boundary to ensure that http isn't the end of a larger word and because it discards quickly many impossible positions in the string. But, when you do that, keep in mind that a large text contains more word boundaries than http substrings. You also need to know that a quick search algorithm is used before the "normal" regex walk to select possible positions in the string when the pattern starts with a literal substring like http. If you put a word boundary before this literal substring, this fast algorithm isn't executed. That's why, sometimes when the text is large, a good alternative to:

~\bhttps?://[^/]+/[ti]/[^/]+/\S+~

can be something like:

~http(?<=\bhttp)s?://[^/]+/[ti]/[^/]+/\S+~

that checks backward using a lookbehind if the word boundary exists.

Upvotes: 3

Related Questions