Reputation: 37626

Longest common suffix for domain names in bash

Original problem: I have a list of subdomains of one domain, e.g., a.domain.fr, b.domain.fr, and so one. The domain itself can be in the list.

I want to find domain.fr from this list of domain, which is finding the longest common suffix that does not start with a dot ..

The list of domains is a bash string, and the domains are separated by a single space.

I read Longest common prefix of two strings in bash but I did not manage to convert it to suffix:

echo $domains | tr ' ' '\n' | sed -e 'N;s/^.*\(.*\)\n.*\1$/\1/'

...prints a bunch of empty lines, and:

echo $domains | tr ' ' '\n' | sed -e 'N;s/^.*\.\(.*\)\n.*\.\1$/\1/'

...prints a bunch of fr.

I am not looking for extreme portability, something that works on any Linux distribution without extra installation is ok for me.

I am looking for a solution that can find sub-domain as the common "domain", e.g., for the following list:

a.d.domain.fr b.d.domain.fr c.d.domain.fr

...the common domain should be d.domain.fr, but if you have an efficient solution that only works for top-domain (e.g., that would return domain.fr for the above list), I am also interested.

Sample strings (one sample per line):

a.domain.fr domain.fr b.a.domain.fr b.domain.fr u.domain.fr
domain.fr
a.domain.fr
a.domain.fr b.domain.fr domain.fr
a.d.domain.fr b.d.domain.fr c.d.domain.fr

Upvotes: 3

Answers (5)

pjh

Reputation: 8134

This is a pure Bash program containing a possible solution:

#! /bin/bash -p

# A space-separated list of domains
domainlist=$1

longest=
longest_rx='\.([^ ]*) .*\.\1$'
for domain in $domainlist ; do
    if [[ -z $longest ]] ; then
        longest=$domain
    elif [[ ".$longest .$domain" =~ $longest_rx ]] ; then
        longest=${BASH_REMATCH[1]}
    else
        longest=
        break
    fi
done

printf "longest='%s'\n" "$longest"

Example usage is: ./progname 'a.d.domain.fr b.d.domain.fr c.d.domain.fr' (outputting longest='d.domain.fr').
It makes no attempt to check for bad input (domains beginning with dots, domains containing glob metacharacters, ...).
It depends on Bash regular expressions supporting backreferences (\1). That should be OK for Linux (tested with Bash 3 on a 10+ year old system), but not some other systems (including some Unix systems).
I haven't done significant performance testing, but there are no obvious performance problems (the program completes in a few milliseconds) with the example inputs given in the question.

Upvotes: 1

Dudi Boy

Reputation: 4890

I enjoyed writing an awk program to solve this problem. This program finds the longest suffix string and shortest suffix string using the provided list of strings.

The longest common suffix can be longer, and the shortest common suffix is 1 char.

The matching algorithm find right side match with the provided strings.

script-1.awk:

{
    # read the fields into a unique array
    for(i = 1; i <= NF; i++){
        if ($i in uniquenessArr == 0) { #accept a field into arr only if not in the uniquenessArr
            uniquenessArr[$i] = 1;
            arr[++arrLen] = $i;
        }
    }
    # arrLen is count of fields to compute
    minLen = 9999999; # initial length of minimal matched string
    for(currStr in arr){ # for each string in arr
        len = length(arr[currStr]);
        # print currStr ") " arr[currStr] "   (" len ")";
        for(targetStr in arr) { # match each string against longer strings in arr
            if ( (len < length(arr[targetStr])) && match(arr[targetStr], arr[currStr]"$") ) {
                # currStr is matched into a longer string
                if (maxLen <= RLENGTH ) {
                    maxLen = RLENGTH;
                    maxMatch = arr[currStr];
                }
                if (minLen >= RLENGTH ) {
                    minLen = RLENGTH;
                    minMatch = arr[currStr];
                }
            }
        }
    }
    printf("maxMatch = %s \t minMatch = %s\n", maxMatch, minMatch);
}

Input test file:

a.domain.fr domain.fr b.a.domain.fr b.domain.fr u.domain.fr
domain.fr
a.domain.fr
a.domain.fr b.domain.fr domain.fr
a.d.domain.fr b.d.domain.fr c.d.domain.fr d.domain.fr c.b.d.domain.fr b.c.d.domain.fr

Execution command:

awk -f script-1.awk input

Some notes:

The first for loop read all the fields into a set (no duplicates) The logic is scanning each string against longer string. If a match is found mark the longest and shortest matching strings.

Upvotes: 0

kvantour

Reputation: 26481

This single line awk does what you expect:

awk '{d=$1; for(i=2;i<=NF;++i) while(d && ! match($i,d"$")) sub(/[^.]*./,"",d); print d}'

You know that the first domain in the list is the largest possible solution.
If that solution does not match the next field, deconstruct that solution by removing the leading entry.
Keep doing that until you find a match or the full domain is removed.

The above solution will only print the matching domain part.

If you want to make it a bit more robust you have to add some corrections to it because: * match will match a regular expression and . matches any character * you have to ensure that the ere starts with . These things were not taken care of in the first example.

The correction is here:

awk '{d=$1; gsub(/[.]/,"\\.",d); for(i=2;i<=NF;++i) while(d && ! match($i,"(^|[.])"d"$")) { sub(/[^.]*([.]|$)/,"",d)}; gsub(/[\\][.]/,".",d);print d}'

Upvotes: 1

Tom Fenech

Reputation: 74665

You can use awk to compare each part of the domain one by one and keep track of the number of common parts:

# on the first line
NR == 1 {
  # split first domain into "parts" for comparison with rest
  n = split($1, parts, /\./)
  # initialise result
  c = n
}

# on every line
{ 
  for (i = 1; i <= NF; ++i) {
    # split current record into "s"
    m = split($i, s, /\./)

    # increment j as long as the last elements of "parts" match "s"
    for (j = 0; j < c && parts[n-j] == s[m-j]; ++j);

    # update count if lower
    if (j < c) c = j
  } 
}

# print the result, joining the parts with a "." and ending with a newline
END { for (i = 1; i <= c; ++i) printf "%s%s", parts[n-c+i], (i < c ? "." : ORS) }

Save the script and run it like awk -f script.awk file.

Upvotes: 3

KamilCuk

Reputation: 141235

It's a little hard to match longest trailing string with sed, as the first .* will eat all characters from input. But we can just be dummy and simply reverse the string. I also added the \. to match inside sed, so that domain.fr and not_in_domain.fr don't result in domain.fr but in fr.

printf "%s\n" a.domain.fr b.domain.fr | rev | sed -e 'N;s/^\(.*\)\..*\n\1\..*$/\1/' | rev

will output:

domain.fr

As this sed can handle only two strings at a time, for more expressions we have to "fold" it:

printf "%s\n" a.a.domain.fr b.a.domain.fr b.not_in_a.domain.fr | 
rev | 
{ 
    # the function
    f() { 
       printf "%s.\n" "$@" | 
       sed -e 'N;s/^\(.*\)\..*\n\1\..*$/\1/'; 
    }; 
    # load initial
    IFS= read -r res; 
    # for each line
    while IFS= read -r line; do
         # right fold it
         res=$(f "$res" "$line");
    done; 
    printf "%s\n" "$res"; 
} | rev

@edit fixed matching leading dot by including it in sed

Upvotes: 2