user311509
user311509

Reputation: 2866

Fetch All URLs from a Page using Regex

Original format:

<a href="http://www.example.com/t434234.html" ...>

1. I need to fetch all URLs of this format:

http://www.example.com/t[ANY CHARACTER].html

ANY CHARACTER is where value changes from URL to another. The rest are fixed.

Here is my attempt:

preg_match("#http:\/\/www\.aqarcity\.com\/t[a-zA-Z0-9_]\.html#", $page, $urls);

I get empty results. I don't know where i went wrong...

Upvotes: 0

Views: 222

Answers (1)

Becca Royal-Gordon
Becca Royal-Gordon

Reputation: 17861

The problem appears to be that [a-zA-Z0-9_] will only match exactly one character. If you want to match zero or more characters, use [a-zA-Z0-9_]*. For one or more, use [a-zA-Z0-9_]+. For exactly six characters, use [a-zA-Z0-9_]{6}. For e.g. one to six characters, use [a-zA-Z0-9_]{1,6}.

Also note that, since you're using # as the delimiter, you don't need to escape the / characters. As far as I know this will not make your code misbehave, but it'll be easier to read if you remove the backslashes before the slashes.

Finally, please realize that regular expressions are a rather dangerous way to work with HTML. In this case, you may pick up matching URLs from comments, Javascript code, and other things that aren't links. It is literally impossible to correctly parse HTML with unaugmented regular expressions—they don't have the expressive power necessary to do so. I don't know what sorts of HTML parsers are available for PHP, but you may want to look into them.

Upvotes: 1

Related Questions