Reputation: 1126
I am given a file (usually the content of a grep) that contains one URL per line.
I am looking for a way to sort the urls such as :
Here is an example of a file containing the what there is to sort :
www.example.com
www.my-website.com
www.example.org
my-website.com
www.my-website.org
And how it would be sorted :
www.example.com
www.example.org
my-website.com
www.my-website.com
www.my-website.org
For now, I use a solution that's quite suboptimal because I sort by top-level-domain first with
... | rev | sort -u | rev
# notice the -u flag in the sort, it is optional but appreciated
It should be said that this piece of software is to be used in (forseeably) two cases :
When analysing the content of Httpd conf files (especially grep-ing ServerName-s and ServerAlias-es and feeding that to DNS-querying operations)
When analysing the result of some web-crawling ( mostly a recursive wget
fed to a Flex scanner, to extract URLs)
In both case, most of the URLs are related to each other.
How can I "smart"-sort these URLs in bash ?
Upvotes: 0
Views: 791
Reputation: 3141
Put a dot before www-less hostnames with sed:
$ cat dom.txt |sed -e 's/^\([^.]*\.[^.]*\)$/.\1/'|sort -t . -k2|sed -e 's/^\.//'
www.example.com
www.example.org
my-website.com
www.my-website.com
www.my-website.org
Upvotes: 2