Using awk to find a domain name containing the longest repeated word

Question

For example, let's say there is a file called domains.csv with the following:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org

I'm trying to use linux awk regex expressions to find the line that contains the longest repeated¹ word, so in this case, it will return the line

5,letswelcomewelcomeyou.org

How do I do that?

¹ Meaning "immediately repeated", i.e., abcabc, but not abcXabc.

Benjamin W. · Accepted Answer

A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.

I'ved added one line to the example input file for the case of multiple longest words:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org

And this gets the lines with the longest repeated sequence:

cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile

Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:

Remove the first column with the line number to avoid matches for lines numbers with repeating digits:

$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org

Get all lines with a repeated sequence, extract just that repeated sequence:

... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel

Get the length of each of those lines:

... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel

Sort by the first column, numerically, descending:

...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll

Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
```
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
```

Pipe this into grep, using the -f - argument to read stdin as a file:

... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org

Limitations

While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.

Alternative solution with more awk, less sort

As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:

$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile

Let's look at that awk step in more detail, with expanded variable names for clarity:

{
    # New longest match: throw away stored longest matches, reset index
    if (length() > max_len) {
        max_len = length()
        delete arr_longest
        idx = 1
    }

    # Add line to longest matches
    if (length() >= max_len)
        arr_longest[idx++] = $0
}

# Print all the longest matches
END {
    for (idx in arr_longest)
        print arr_longest[idx]
}

Benchmarking

I've timed the two solutions on the top one million domains file mentioned in the comments:

First solution (with sort and two awk steps):

964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com

real    1m55.742s
user    1m57.873s
sys     0m0.045s

Second solution (just one awk step, no sort):

964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com

real    1m55.603s
user    1m56.514s
sys     0m0.045s

And the Perl solution by Casimir et Hippolyte:

964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com

real    0m5.249s
user    0m5.234s
sys     0m0.000s

What we learn from this: ask for a Perl solution next time ;)

Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.

Portability remark

Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.

Using awk to find a domain name containing the longest repeated word

Answers (2)

Related Questions