Reputation: 37626
Original problem: I have a list of subdomains of one domain, e.g., a.domain.fr
, b.domain.fr
, and so one. The domain itself can be in the list.
I want to find domain.fr
from this list of domain, which is finding the longest common suffix that does not start with a dot .
.
The list of domains is a bash string, and the domains are separated by a single space.
I read Longest common prefix of two strings in bash but I did not manage to convert it to suffix:
echo $domains | tr ' ' '\n' | sed -e 'N;s/^.*\(.*\)\n.*\1$/\1/'
...prints a bunch of empty lines, and:
echo $domains | tr ' ' '\n' | sed -e 'N;s/^.*\.\(.*\)\n.*\.\1$/\1/'
...prints a bunch of fr
.
I am not looking for extreme portability, something that works on any Linux distribution without extra installation is ok for me.
I am looking for a solution that can find sub-domain as the common "domain", e.g., for the following list:
a.d.domain.fr b.d.domain.fr c.d.domain.fr
...the common domain should be d.domain.fr
, but if you have an efficient solution that only works for top-domain (e.g., that would return domain.fr
for the above list), I am also interested.
Sample strings (one sample per line):
a.domain.fr domain.fr b.a.domain.fr b.domain.fr u.domain.fr
domain.fr
a.domain.fr
a.domain.fr b.domain.fr domain.fr
a.d.domain.fr b.d.domain.fr c.d.domain.fr
Upvotes: 3
Views: 506
Reputation: 8134
This is a pure Bash program containing a possible solution:
#! /bin/bash -p
# A space-separated list of domains
domainlist=$1
longest=
longest_rx='\.([^ ]*) .*\.\1$'
for domain in $domainlist ; do
if [[ -z $longest ]] ; then
longest=$domain
elif [[ ".$longest .$domain" =~ $longest_rx ]] ; then
longest=${BASH_REMATCH[1]}
else
longest=
break
fi
done
printf "longest='%s'\n" "$longest"
./progname 'a.d.domain.fr b.d.domain.fr c.d.domain.fr'
(outputting longest='d.domain.fr'
).\1
). That should be OK for Linux (tested with Bash 3 on a 10+ year old system), but not some other systems (including some Unix systems).Upvotes: 1
Reputation: 4890
I enjoyed writing an awk program to solve this problem. This program finds the longest suffix string and shortest suffix string using the provided list of strings.
The longest common suffix can be longer, and the shortest common suffix is 1 char.
The matching algorithm find right side match with the provided strings.
{
# read the fields into a unique array
for(i = 1; i <= NF; i++){
if ($i in uniquenessArr == 0) { #accept a field into arr only if not in the uniquenessArr
uniquenessArr[$i] = 1;
arr[++arrLen] = $i;
}
}
# arrLen is count of fields to compute
minLen = 9999999; # initial length of minimal matched string
for(currStr in arr){ # for each string in arr
len = length(arr[currStr]);
# print currStr ") " arr[currStr] " (" len ")";
for(targetStr in arr) { # match each string against longer strings in arr
if ( (len < length(arr[targetStr])) && match(arr[targetStr], arr[currStr]"$") ) {
# currStr is matched into a longer string
if (maxLen <= RLENGTH ) {
maxLen = RLENGTH;
maxMatch = arr[currStr];
}
if (minLen >= RLENGTH ) {
minLen = RLENGTH;
minMatch = arr[currStr];
}
}
}
}
printf("maxMatch = %s \t minMatch = %s\n", maxMatch, minMatch);
}
a.domain.fr domain.fr b.a.domain.fr b.domain.fr u.domain.fr
domain.fr
a.domain.fr
a.domain.fr b.domain.fr domain.fr
a.d.domain.fr b.d.domain.fr c.d.domain.fr d.domain.fr c.b.d.domain.fr b.c.d.domain.fr
awk -f script-1.awk input
The first for loop read all the fields into a set (no duplicates) The logic is scanning each string against longer string. If a match is found mark the longest and shortest matching strings.
Upvotes: 0
Reputation: 26481
This single line awk does what you expect:
awk '{d=$1; for(i=2;i<=NF;++i) while(d && ! match($i,d"$")) sub(/[^.]*./,"",d); print d}'
The above solution will only print the matching domain part.
If you want to make it a bit more robust you have to add some corrections to it because:
* match
will match a regular expression and .
matches any character
* you have to ensure that the ere
starts with .
These things were not taken care of in the first example.
The correction is here:
awk '{d=$1; gsub(/[.]/,"\\.",d); for(i=2;i<=NF;++i) while(d && ! match($i,"(^|[.])"d"$")) { sub(/[^.]*([.]|$)/,"",d)}; gsub(/[\\][.]/,".",d);print d}'
Upvotes: 1
Reputation: 74665
You can use awk to compare each part of the domain one by one and keep track of the number of common parts:
# on the first line
NR == 1 {
# split first domain into "parts" for comparison with rest
n = split($1, parts, /\./)
# initialise result
c = n
}
# on every line
{
for (i = 1; i <= NF; ++i) {
# split current record into "s"
m = split($i, s, /\./)
# increment j as long as the last elements of "parts" match "s"
for (j = 0; j < c && parts[n-j] == s[m-j]; ++j);
# update count if lower
if (j < c) c = j
}
}
# print the result, joining the parts with a "." and ending with a newline
END { for (i = 1; i <= c; ++i) printf "%s%s", parts[n-c+i], (i < c ? "." : ORS) }
Save the script and run it like awk -f script.awk file
.
Upvotes: 3
Reputation: 141235
It's a little hard to match longest trailing string with sed, as the first .*
will eat all characters from input. But we can just be dummy and simply rev
erse the string. I also added the \.
to match inside sed, so that domain.fr
and not_in_domain.fr
don't result in domain.fr
but in fr
.
printf "%s\n" a.domain.fr b.domain.fr | rev | sed -e 'N;s/^\(.*\)\..*\n\1\..*$/\1/' | rev
will output:
domain.fr
As this sed
can handle only two strings at a time, for more expressions we have to "fold" it:
printf "%s\n" a.a.domain.fr b.a.domain.fr b.not_in_a.domain.fr |
rev |
{
# the function
f() {
printf "%s.\n" "$@" |
sed -e 'N;s/^\(.*\)\..*\n\1\..*$/\1/';
};
# load initial
IFS= read -r res;
# for each line
while IFS= read -r line; do
# right fold it
res=$(f "$res" "$line");
done;
printf "%s\n" "$res";
} | rev
@edit fixed matching leading dot by including it in sed
Upvotes: 2