Samir Sadek
Samir Sadek

Reputation: 1690

Perl : Extract domain name

Extract the domaine name of an URL

Yet another request to parse an URL, but I have found many incomplete or theoretical examples. I would like to have something that work in perl for sure.

I have the following URLs:

https://vimdoc.sourceforge.net/htmldoc/pattern.html
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html
http://www.catonmat.net/download/perl1line.txt
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM
http://www.gnu.org/software/coreutils/manual/coreutils.html
http://www.catonmat.net/download/perl1line.txt
https://feedly.com/i/my
http://vimhelp.appspot.com/
https://git-scm.com/doc
https://read.amazon.com/
https://github.com/netsamir/following
https://scotch.io/
https://servicios.dgi.gub.uy/
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/
https://training.github.com/
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/
https://www.ctan.org/
https://www.eff.org/
https://www.mybeluga.com/
https://www.solveforx.com/
https://www.symynd.com/
https://www.symynd.com/#
https://www.tizen.org/
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS

Try to extract the domain name only. For instance:

linksyssmartwifi.com
amazon.com
github.com

I have tried with Perl and Vim but could not accomplish the task. My best approximation is the following

 perl -pe 's!(^https?\://.*[\.](.+\..+?)/.*$)!$1 -- [$2] !g' all_urls_sorted.txt

Some of them are correctly parsed (see in []), other not :

   https://sites.google.com/site/steveyegge2/singleton-considered-stupid -- [google.com] 
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/ -- [heroku.com] 
https://training.github.com/ -- [github.com] 
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/ -- [whatsapp.com] 
https://wiki.haskell.org/GHC -- [haskell.org] 

As my tests showed, the URL that start straight from // (in https?://) are being excluded.

If you know how to solve this problem I would be very happy.

Thank

Upvotes: 1

Views: 3121

Answers (3)

Laurel
Laurel

Reputation: 6173

A regex solution is:

//(?:[^./]+[.])*([^/.]+[.][^/.]+)/

If the trailing slash is optional, just add a ?:

//(?:[^./]+[.])*([^/.]+[.][^/.]+)/?

This should be used with the global modifier and a delimiter other than /.

Essentially, it's looking between the // and the next /.

If there are any extra sub-domains, they will be caught by the (?:[^./]+[.])*. The main domain will fall into the capture group ([^/.]+[.][^/.]+).

Upvotes: 2

Miller
Miller

Reputation: 35208

Use the URI module:

#!/usr/bin/env perl

use strict;
use warnings;
use v5.10;

use URI;

while (<DATA>) {
    chomp;
    my $uri = URI->new($_);
    my $host = $uri->host;
    my ($domain) = $host =~ m/([^.]+\.[^.]+$)/;
    say $domain;
}

__DATA__
https://vimdoc.sourceforge.net/htmldoc/pattern.html
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html
http://www.catonmat.net/download/perl1line.txt
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM
http://www.gnu.org/software/coreutils/manual/coreutils.html
http://www.catonmat.net/download/perl1line.txt
https://feedly.com/i/my
http://vimhelp.appspot.com/
https://git-scm.com/doc
https://read.amazon.com/
https://github.com/netsamir/following
https://scotch.io/
https://servicios.dgi.gub.uy/
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/
https://training.github.com/
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/
https://www.ctan.org/
https://www.eff.org/
https://www.mybeluga.com/
https://www.solveforx.com/
https://www.symynd.com/
https://www.symynd.com/#
https://www.tizen.org/
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS

Outputs:

sourceforge.net
linksyssmartwifi.com
catonmat.net
github.com
google.com
gnu.org
catonmat.net
feedly.com
appspot.com
git-scm.com
amazon.com
github.com
scotch.io
gub.uy
sourcemaking.com
stackedit.io
stripe.com
heroku.com
github.com
vimeo.com
vimeo.com
whatsapp.com
ctan.org
eff.org
mybeluga.com
solveforx.com
symynd.com
symynd.com
tizen.org
workforall.net

Upvotes: 5

hd1
hd1

Reputation: 34677

My best approximation is URI::URL:

foreach my $uri (@filecontents) {
    my $uriobj = URL::URL->new($uri);
    my $host = $uriobj -> host;
    my @parts = split /\./, $host;
    print "$uri -- $parts[-2]$parts[-1]\n";
}

Hope that helps.

Upvotes: 3

Related Questions