Reputation: 280
For example, let's say there is a file called domains.csv
with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc
, but not abcXabc
.
Upvotes: 4
Views: 666
Reputation: 52191
A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f -
argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome
case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome
, where it only finds welwel
, but not welcomewelcome
.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort
step and combine the two awk steps and the sort
step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort
and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort
):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1
instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f -
to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f
.
Upvotes: 6
Reputation: 89565
A way with perl:
perl -F, -ane 'if (@m=$F[1]=~/(?=(.+)\1)/g) {
@m=sort { length $b <=> length $a} @m;
$cl=length @m[0];
if ($l<$cl) { @res=($_); $l=$cl; } elsif ($l==$cl) { push @res, ($_); }
}
END { print @res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (@m[0]
).
Once done, the length of the current repeated substring ($cl
) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F,
set the field separator to ,
-ane
(e
execute the following code, n
read a line at a time and puts its content in $_
, a
autosplit, using the defined FS, and puts fields in the @F
array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group. Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).
Upvotes: 6