Registered User
Registered User

Reputation: 9006

Regular expression for extracting certain URLs?

I've already tried my best but regular expressions aren't really my thing. :(

I need to extract certain URLs that end in a certain file extension. For example, I want to be able to parse a large paragraph and extract all URLs that end with *.txt. So for example,

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla hendrerit aliquet erat at ultrices. Donec eu nunc nec nibh http://www.somesite.com/somefolder/blahblah/etc/something.txt iaculis dictum. Quisque nisi neque, vulputate quis pellentesque blandit, faucibus eget nisl.

I need to be able to take http://www.somesite.com/somefolder/blahblah/etc/something.txt out of the above paragraph but the number of URLs to extract will vary. It will be dynamic based on what the user inputs. It can have 3 links that end with *.txt and 3 links that don't end with *.txt. I only need to extract those that does end in *.txt. Can anyone possibly give me the code I need for this?

Upvotes: 0

Views: 87

Answers (3)

Toto
Toto

Reputation: 91385

How about:

$str = 'Lorem ipsum dolor sit amet. Donec eu nunc nec nibh http://www.somesite.com/somefolder/blahblah/etc/something.txt. Lorem ipsum dolor sit amet. Donec eu nunc nec nibh http://www.somesite.com/somefolder/blahblah/etc/something.doc.';
preg_match_all('#\b(http://\S+\.txt)\b#', $str, $m);

explanation:

#             : regex delimiter
\b            : word boundary
(             : begin capture group
http://       : litteral http://
\S+           : one or more non space
\.            : a dot
txt           : litteral txt
)             : end capture group
\b            : word boundary
#             : regex delimiter

Upvotes: 0

RickN
RickN

Reputation: 13500

Assuming these are all proper URLs, then they won't have any spaces in them. We can take advantage of that fact to make the regular expression really simple:

preg_match_all("/([^ ]+\.(txt|doc))/i", $text, $matches);
//   ([^ ]+     Match anything, except for a space.
//   \.         A normal period.
//   (txt|doc)  The word "txt" or "doc".
//   )/i        Case insensitive (so TXT and TxT also work)

If you don't need to match multiple file extensions, then you can change "(txt|doc)" to "txt".

$matches will contain a number of arrays, you'll want key number 0 or 1. To make the array easier to read, you can use:

preg_match_all("/(?P<matched_urls>[^ ]+\.(txt|doc))/i", $text, $matches);

This will make $matches look something like this:

array([0] => array(), [1] => array(), [2] => array(), ["matched_urls"] => array());

Should be obvious which key you need.

Upvotes: 0

SteeveDroz
SteeveDroz

Reputation: 6136

You can find what you need with /(?<=\s)http:\/\/\S+\.txt(?=\s)/

Which means:

  • A space/tab/new line before.
  • http://
  • more that one non-space character.
  • .txt
  • A space/tab/new line after.

Upvotes: 1

Related Questions