Reputation: 155

How to remove a part of an URL with regexes?

How can I turn this:

http://site.com/index.php?id=15

Into this?:

http://site.com/index.php?id=

Which RegEx(s) do I use?

I've been trying to do this for a good 2 hours now and I've had no luck. I can't seem to take out the number(s) at the end, and sometimes there are letters in the end as well which give me problems.

I am using Bing! instead of Google.

My RegEx so far is this when I search something:

$start = '<h3><a href="';
$end = '" onmousedown=';

while ($result =~ m/$start(.*?)$end/g)

What can I add in their to take out the letters and digits in the end and just leave it as an equal sign?

Thank you.

Upvotes: 0

Answers (4)

yb007

Reputation: 1377

How can I turn this:

http://site.com/index.php?id=15

Into this?:

http://site.com/index.php?id=

I think this is the solution you are looking for

#!/usr/bin/perl
use strict;
use warnings;
my $url="http://site/index.php?id=15";
$url =~  s/(?<=id=).*//g;
print $url;

Output :

http://site.com/index.php?id=

as per your need anything after = sign will be omitted from the URL

Upvotes: 0

David W.

Reputation: 107090

I'm not 100% sure what you are doing, but this is the problem:

while ($result =~ m/$start(.*?)$end/g)

What's the purpose of this loop? You're taking a scalar called $result and checking for a pattern match. How is $result changing?

Your original question was how to make this:

http://site.com/index.php?id=15

into this:

http://site.com/index.php?id=

That is, how do you remove the 15 (or another number) from the expression. The answer is pretty simple:

$url =~ s/=\d+$/=/;

That'll anchor your regular expression at the end of the URL replacing the ending digits with nothing.

If you're removing any string, it's a bit more complex:

$url =~ s/=[^=]+/=/;

You can't simply use \S+ because regular expressions are normally greedy. Therefore, you want to specify any series of non-equal sign characters preceded by an equal sign.

Now, as for the while loop, maybe you want an if statement instead...

if ($result =~ /$start(.*?)$end/g) {
    print "Doing something if this matched\n";
}
else {
    print "Doing something if there's no match\n";
}

And, I'm not sure what this means:

I am using Bing! instead of Google.

Are you trying to parse the input from Bing!? If so, please explain exactly what you're really trying to do. Maybe we know a better way of doing this. For example, if you're parsing the output of a search result, there might be an API that you can use.

Upvotes: 0

Ashley

Reputation: 4335

You asked for a regular expression solution but your problem is a bit ill-defined and regexes for HTML are only for stop-gap/one-off stuff or else you’re probably just hurting yourself.

Since I am really not positive what your actual need and HTML source look like this is a generic solution to taking a URL and spitting out all the links found on the page without query strings. Having id= is for all reasonable purposes/code equivalent to no id.

There are many ways, at least three or four of them good solutions, to do this in Perl. This is one that is often overlooked: libxml. Docs: XML::LibXML, URI, and URI::QueryParam (if you want better query manipulation).

use warnings;
use strict;
use URI;
use XML::LibXML;

my $source = shift || die "Give a URL!\n";

my $parser = XML::LibXML->new;
$parser->recover(1);

my $doc = $parser->load_html( location => $source );

for my $anchor ( $doc->findnodes('//a[@href]') )
{
    my $uri = URI->new_abs( $anchor->getAttribute("href"), $source );
    # commented out ideas.
    # next unless $uri->host eq "TARGET HOST NAME";         
    # next unless $uri->path eq "TARGET PATH";
    # Clear the query completely; id= might as well be nothing.
    $uri->query(undef);
    print $uri, $/;
}

It sounds like maybe you’re using Bing! for scraping. This kind of thing is against pretty much every search engine’s ToS. Don’t do it. They have APIs (well, Google does at least) if you register and get a dev token.

Upvotes: 1

msw

Reputation: 43527

Since you cannot parse [X]HTML properly with regular expressions, you should look for the minimum possible context that will get you the href you want.

To the best of my knowledge, the one character that cannot be in a href is ". therefore

/href="([^"]+)"/

Should yield a URL in $1. I would sanity check it for URL-ishness before extracting the id string you want, and then:

s/\?id=\w+/id=/

But this has hack written all over it, because you can't parse HTML with regular expressions. So it will probably break the first time you demonstrate it to a customer.

You should really check out proper Perl parsing: http://www.google.com/webhp?q=perl+html+parser

Upvotes: 3

How to remove a part of an URL with regexes?

Answers (4)

Related Questions