thatguy
thatguy

Reputation: 837

extracting one or more urls from a string in php

I'm trying to extract one or more urls from a plain text string in php. Here's some examples

"mydomain.com has hit the headlines again"

extract " http://www.mydomain.com"

"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"

extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"

There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com

p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.

Thanks

Upvotes: 0

Views: 227

Answers (1)

Ernest
Ernest

Reputation: 8839

In this case it will be hard to get 100% correct results. Depending on the input you may try to force matching just most popular first level domains (add more to it):

(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b

You may need to remove the word boundary (\b) to get different results.

You can test it here:

http://bit.ly/dlrgzQ

EDIT: about your cases 1) remove from what? 2) this could be done in php like:

 $result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);

But I have few important notes:

  • This Regex are more like guidance, not actual production code
  • Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:

http://example.org

but not!

example.org

  • It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.

Also get interested in: http://htmlpurifier.org/

Upvotes: 4

Related Questions