Adrian May
Adrian May

Reputation: 2182

Grep or the like: overlapping matches

For:

echo "the quick brown fox" | grep -Po '[a-z]+ [a-z]+'

I get:

the quick
brown fox

but I wanted:

the quick
quick brown
brown fox

How?

Upvotes: 4

Views: 853

Answers (3)

waldir
waldir

Reputation: 11

Simply reusing the original solution to get the markov chain:

echo "the quick brown fox" | grep -Po '[a-z]+ [a-z]+'
echo "the quick brown fox" | sed 's/^[a-z]* //' | grep -Po '[a-z]+ [a-z]+'

The second line (namely sed) removes the first word of the input. Therefore, rest of the command generates the missing pairs.

The same approach could also be generalized using sed's ability to run loops:

 echo pattern1pattern2 | sed ':start;s/\(pattern1\)\(pattern2\)/<\1|\2>\2/;t start' | grep -o '<[^>]*>' | tr -d '<>|'

This solution will work with partially overlapping patterns where pattern2 can be overlapped by next match. It assumes <>| to be reserved auxiliary characters. Furthermore it assumes that the pattern1pattern2 regex cannot match any string that is matched by pattern2 alone.

The sed substitues pattern1pattern2 with <pattern1|pattern2>pattern2 and repeats this substitution as long as any matches are found (the branching t command allows matching previously substituted strings, unlike the g option). I.e., in every iteration, one <pattern1|pattern2> group is left behind indicating our matches, while an instance of pattern2 can still be matched within next match. Finally, we pick the groups using the original approach and strip the auxiliary marks.

Upvotes: 1

tso
tso

Reputation: 4924

with awk:

 awk '{for(i=1;i<NF;i++) print $i,$(i+1)}' <<<"the quick brown fox"

update: with python:

#!/usr/bin/python3.5
import re
s="the quick brown fox"
matches = re.finditer(r'(?=(\b[a-z]+\b \b[a-z]+\b))',s)
ans=[i.group(1) for i in matches]
print(ans) #or not print
for i in ans:
    print(i)

output:

['the quick', 'quick brown', 'brown fox']
the quick
quick brown
brown fox

Upvotes: 2

Claes Wikner
Claes Wikner

Reputation: 1517

another awk:

awk '{print $1,$2 RS $2,$3 RS $3,$4}' <<<"the quick brown fox"

    the quick
    quick brown
    brown fox

Upvotes: 0

Related Questions