Reputation: 35
I want to extract every word that comes after the pattern, however, I can only extract the word is in the same line with the pattern, if the word is come right after a line break I'm not able to get it. For example,
Gary is a college student.
Steve and John are college
teachers.
I want to extract "student" and "teachers", but I only got "student" back. My solution is
grep -oP '(?<=college )[\w+]*' | sort | uniq
Upvotes: 0
Views: 333
Reputation: 52334
Tools like grep
are fundamentally line oriented. GNU grep has a -z
option to use 0 bytes as delimiters instead of newlines, though, which will let you treat the input file as a single big 'line':
$ grep -Pzo 'college\s+\K\w+' input.txt | tr '\0' '\n'
student
teachers
Upvotes: 1
Reputation: 189317
grep
(or really, generally, most Unix text processing tools) examine a single line, and can't straddle a match across line boundaries. A simple Awk script might work instead:
awk '{ for(i=1; i<NF; ++i)
if ($i=="college") print $(i+1) }
$NF=="college" { n=1 }
n { print $1; n=0 }' file
You can easily refactor this to count the number of hits in Awk, too, and avoid the pipe to sort | uniq
(or, better, sort -u
), but I left that as an exercise. Learning enough Awk to write simple scripts like this yourself is time well spent.
Upvotes: 0