Blnukem
Blnukem

Reputation: 183

Perl Strip Regex For URI

I'm trying to strip out all https, http, www, /, : and . out of a domain name to create a user account folder on my system. So what I need is to make a URL that looks like this "https://www.My-Domain.com/" into "My-Domaincom" I'm close but just cant seem to get it to work.

our $DomainAccount = lc($ENV{HTTP_REFERER});
  $DomainAccount =~ s/^http:\/\/|^https:\/\///;
  $DomainAccount =~ s/^www\.|(/.)//;

Upvotes: 1

Views: 873

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You just need to make sure you match the http:// or https:// that is optionally followed with www., match and capture the host URL part up to the first / and then match the rest, and replace with the backreference to the first capture group $1, and in order to remove . from the host.com you need to use a second capturing group like this:

$DomainAccount =~ s/^https?:\/\/(?:www\.)?([^\/.]+)\.([^\/.]+).*/$1$2/i;

Output for "https://www.My-Domain.com/": My-Domaincom

See the regex demo here.

Note I added a case-insensitive flag /i just to make sure the pattern can handle HTTP:// casing, too.

The regex matches:

  • ^ - start of string
  • https?:\/\/ - a literal character sequence http:// or https://
  • (?:www\.)? - one or zero occurrences of a literal character sequence www.
  • ([^\/.]+) - Group 1: one or more characters other than / and .
  • \. - a literal dot
  • ([^\/.]+) - Group 2: one or more characters other than / and .
  • .* - rest of the line

To address choroba's comment, here is a 2 step solution that will work with URLs containing more than one dot in the host part:

$DomainAccount =~ s/^https?:\/\/(?:www\.)?([^\/]+).*/$1/i;
$DomainAccount =~ s/\.//g;

Upvotes: 1

choroba
choroba

Reputation: 241988

URI can help you, but you still have to remove the www yourself:

#! /usr/bin/perl
use warnings;
use strict;

use URI;

my $url = 'URI'->new('https://www.My-Domain.com/');
my $account = $url->host;
$account =~ s/^[^.]*\.// while 1 != $account =~ tr/.//;
$account =~ s/\.//;
print $account, "\n";

This only leaves the top and second level domains in the result (try with e.g. http://some.very.long.domain.name.com).

Upvotes: 1

Related Questions