Question-er XDD
Question-er XDD

Reputation: 767

How can I extract parts of a URL using regular expressions in Perl?

The general format of a URL is scheme://domain:port/path?query_string#fragment_id

While domain (and possible other parts of the URL) may contain Unicode characters, in the following we assume that only ASCII characters are used. Furthermore, we assume that

  • scheme only consists of letters a–z and A–Z;
  • domain does not contain :, ?, # or /;
  • port is a natural number, :port is optional;
  • path does not contain ? or #, path is optional;
  • query_string does not contain #, ?query_string is optional;
  • fragment_id can contain arbitrary characters, #fragment_id is optional.

Here is my code:

@urls = (
    "http://www.example.com/",
    "http://www80.local.com:80/",
    "https://www.ex221.ac.uk:442/perl/rulez?all+q#all.time");

foreach (@urls) {
    print "URL: $_\n";
    ($scheme,$domain,$port,$path,$query,$fragment) = (/(.)(.)(.)(.)(.)(.)/);
    print "SCHEME: $scheme, DOMAIN: $domain, PORT: $port\n";
    print "PATH: $path\n"; print "QUERY: $query\n";
    print "FRAGMENT: $fragment\n\n";
}

How to change the regular expression in the code above so that it correctly separates the five components of a URL and use the sample URLs to test that it works as expected.

Upvotes: 2

Views: 1323

Answers (2)

ikegami
ikegami

Reputation: 385655

Regular expressions are documented in perlre (reference manual) and perlretut (tutorial).

That said, the following is all the information you need to complete your assignment.

To match any of a number of characters, you can use character class.

[abcdef]      # Matches a, b, c, d, e or f

You can use ranges of letters.

[a-zA-Z]      # Matches any lowercase or uppercase letter

To match any characters except some, start the class with ^.

[^abcdef]     # Matches any character except a, b, c, d, e or f

If you follow something with *, it means zero or more of that something.

ab*c          # Matches ac, abc, abbc, abbbc, ...

Don't forget to escape special characters with \ if you don't want their special meaning.

ab\*c         # Matches ab*c

Upvotes: 1

Miguel Prz
Miguel Prz

Reputation: 13792

I recommend that you use the URI module:

use URI;

my @urls = (
    "http://www.example.com/",
    "http://www80.local.com:80/",
    "https://www.ex221.ac.uk:442/perl/rulez?all+q#all.time");

foreach (@urls) {
    my $uri = URI->new($_);
    print "URL: $_\n";
    print "SCHEME: ", $uri->scheme, "\n";
    print "DOMAIN: ", $uri->host, "\n";
    print "PORT: ", $uri->port, "\n";
    print "PATH: ", $uri->path, "\n";
    print "QUERY: ", $uri->query, "\n";
    print "FRAGMENT: ", $uri->fragment, "\n";
}

Upvotes: 8

Related Questions