cullan
cullan

Reputation: 280

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org

I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line

5,letswelcomewelcomeyou.org

How do I do that?

1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.

Upvotes: 4

Views: 666

Answers (2)

Benjamin W.
Benjamin W.

Reputation: 52191

A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.

I'ved added one line to the example input file for the case of multiple longest words:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org

And this gets the lines with the longest repeated sequence:

cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile

Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:

  1. Remove the first column with the line number to avoid matches for lines numbers with repeating digits:

    $ cut -d ',' -f 2 infile
    helloguys.ca
    byegirls.com
    hellohelloboys.ca
    hellobyebyedad.com
    letswelcomewelcomeyou.org
    letscomewelcomewelyou.org
    
  2. Get all lines with a repeated sequence, extract just that repeated sequence:

    ... | grep -Eo '(.*)\1'
    ll
    hellohello
    ll
    byebye
    welcomewelcome
    comewelcomewel
    
  3. Get the length of each of those lines:

    ... | awk '{ print length(), $0 }'
    2 ll
    10 hellohello
    2 ll
    6 byebye
    14 welcomewelcome
    14 comewelcomewel
    
  4. Sort by the first column, numerically, descending:

    ...| sort -k 1,1 -nr
    14 welcomewelcome
    14 comewelcomewel
    10 hellohello
    6 byebye
    2 ll
    2 ll
    
  5. Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:

    ... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
    welcomewelcome
    comewelcomewel
    
  6. Pipe this into grep, using the -f - argument to read stdin as a file:

    ... | grep -f - infile
    5,letswelcomewelcomeyou.org
    6,letscomewelcomewelyou.org
    

Limitations

While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.

Alternative solution with more awk, less sort

As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:

$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile

Let's look at that awk step in more detail, with expanded variable names for clarity:

{
    # New longest match: throw away stored longest matches, reset index
    if (length() > max_len) {
        max_len = length()
        delete arr_longest
        idx = 1
    }

    # Add line to longest matches
    if (length() >= max_len)
        arr_longest[idx++] = $0
}

# Print all the longest matches
END {
    for (idx in arr_longest)
        print arr_longest[idx]
}

Benchmarking

I've timed the two solutions on the top one million domains file mentioned in the comments:

  • First solution (with sort and two awk steps):

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    1m55.742s
    user    1m57.873s
    sys     0m0.045s
    
  • Second solution (just one awk step, no sort):

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    1m55.603s
    user    1m56.514s
    sys     0m0.045s
    
  • And the Perl solution by Casimir et Hippolyte:

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    0m5.249s
    user    0m5.234s
    sys     0m0.000s
    

What we learn from this: ask for a Perl solution next time ;)

Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.

Portability remark

Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.

Upvotes: 6

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89565

A way with perl:

perl -F, -ane 'if (@m=$F[1]=~/(?=(.+)\1)/g) {
    @m=sort { length $b <=> length $a} @m;
    $cl=length @m[0];
    if ($l<$cl) { @res=($_); $l=$cl; } elsif ($l==$cl) { push @res, ($_); }
}
END { print @res; }' file

The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (@m[0]).

Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.

details:

command line option:

-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the @F array)

The pattern:

/
(?=         # open a lookahead assertion
    (.+)\1  # capture group 1 and backreference to the group 1
)           # close the lookahead
/g # all occurrences 

This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group. Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

Upvotes: 6

Related Questions