Removing fragments from URLs

Question

I want to get rid of fragments (like #foobar) from URLs, but based on certain rules. Normally a brutal regex would have solved the problem;

$url =~ s/#.+//;

but I want it to take several things into consideration, most notably these transformations

http://www.example.com/#/           => http://www.example.com/
http://www.example.com/#foo/bar#foo => http://www.example.com/#foo/bar
http://www.example.com/#foo?a=1     => http://www.example.com/#foo?a=1
http://www.example.com/#foo/?a=1    => http://www.example.com/#foo/?a=1

So the rules should be:

1) If /#/, just replace it with /.

2) If # is not followed upstream by a / or ?, remove it.

Any ideas how to deal with this properly? One regex or use of other modules?

Miller · Accepted Answer

The regex s{#(?:/|[^?/]*)$}{} will cover these rules as stated:

If /#/, just replace it with /.
If # is not followed upstream by a / or ?, remove it.

And the test suite to demonstrate:

use strict;
use warnings;

use Test;

BEGIN { plan tests => 4 }

while () {
    chomp;
    my ($source, $goal) = split /\s*=>\s*/;

    $source =~ s{#(?:/|[^?/]*)$}{};

    ok($source, $goal);
}

__DATA__
http://www.example.com/#/           => http://www.example.com/
http://www.example.com/#foo/bar#foo => http://www.example.com/#foo/bar
http://www.example.com/#foo?a=1     => http://www.example.com/#foo?a=1
http://www.example.com/#foo/?a=1    => http://www.example.com/#foo/?a=1

Output:

1..4
# Running under perl version 5.018002 for MSWin32
# Current time local: Fri May 30 15:01:04 2014
# Current time GMT:   Fri May 30 22:01:04 2014
# Using Test.pm version 1.26
ok 1
ok 2
ok 3
ok 4

Removing fragments from URLs

Answers (1)

Related Questions