Reputation: 21
I wish to match a pattern across multiple lines in a shell script. My input is as:
START <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
ID: n1 <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
END
START <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
ID: n2 <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
END
I am trying to display the output using regex for a specific ID only (eg. n1 or n2). I tried START(.|\n)*ID: n1(.|\n)*END
regex but it fetches the data of ID: n2 as well. What changes should I make to the regex inorder to get data of only the specific ID?
I am using cat inputfile | grep 'pattern' > outputfile
as the command.
The number of lines in each block as well as the number of lines between START
and ID: n1
, ID: n1
and END
can be variable and hence using head/tail is not a viable option. Also, I would like to print the whole block from START to END when the ID is matched.
EDIT: I tried using an Online Regex Creator and it could successfully match the regex
START[\s\S][^END]*ID: n1[\s\S][^END]*END
on my input file.
Upvotes: 1
Views: 1053
Reputation: 437062
A GNU awk
or Mawk solution that can handle any number of lines, including empty ones, between paired START
and END
occurrences:
awk -v id='n2' -v RS='(^|\n)START |\nEND' '
$0 ~ ("\nID: " id " ") { print "START " $0 "\nEND" }
' file
This solution uses a multi-character RS
value (that is also a regex), which is not supported in the POSIX spec. Both GNU awk
and Mawk (the default awk
on Ubuntu) support such values, however, whereas BSD/macOS awk
does not.
-v id='n2'
passes ID value n2
as variable id
to Awk.
RS='(^|\n)START |\nEND'
breaks the input into records by (line-spanning) text between tokens START
at the start of the input / a line and token END
after a newline.
$0 ~ ("\nID: " id " ")
matches each input record ($0
) against a regex (~
) that matches the specified ID: a newline followed by ID:
, followed by the ID value of interest (stored in variable id
) and a space.
Note how string concatenation in Awk works by simply placing strings / variable references next to each other.
In case of a match, print "START " $0 "\nEND"
prints the input record at hand, bookended by the START
and END
tokens (which, as the input record separators, doesn't report as part of $0
).
If the lines between paired START
and END
occurrences are all nonempty (i.e., contain at least 1 char., even if that char. is a space or tab), here's a POSIX-compliant awk
solution:
awk -v id='n2' -v RS= '$0 ~ ("\nID: " id " ")' file
Note that -v RS=
, i.e., setting the input record separator (RS
) to the empty string, is an awk
idiom that breaks the input into records by paragraphs (runs of nonempty lines).
Upvotes: 1
Reputation: 103704
In awk
you can accumulate the text between your starting pattern and ending pattern and then test that buffer for your match:
cat inputfile | awk '/^START/ { buf=$0 "\n"; flag=1; next }
flag { buf=buf $0 "\n" }
/^END/ && flag { flag=0; if (buf ~ /ID: n1 |ID: n2 /) print buf }'
In Perl you can do:
cat inputfile | perl -0777 -lne 'while (/(^START.*?^ID: (n\d+) .*?^END)/gms){
if ($2 eq "n1" || $2 eq "n2"){
print "$1\n\n";
}
}'
In either case, you may want to do awk '{script}' inputfile
or perl '{script}' inputfile
rather than using cat
Upvotes: 0
Reputation: 41987
awk
in paragraph mode, using two successive newlines as record separator:
awk -v RS='\n\n' '/ID: n1/' file.txt
Replace n1
with n2
, n3
... for others.
Example:
$ cat file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n1 <some data including white spaces>
<some data including white spaces>
END
START <some data including white spaces>
<some data including white spaces>
ID: n2 <some data including white spaces>
<some data including white spaces>
END
START <some data including white spaces>
<some data including white spaces>
ID: n3 <some data including white spaces>
<some data including white spaces>
END
$ awk -v RS='\n\n' '/ID: n1/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n1 <some data including white spaces>
<some data including white spaces>
END
$ awk -v RS='\n\n' '/ID: n2/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n2 <some data including white spaces>
<some data including white spaces>
END
$ awk -v RS='\n\n' '/ID: n3/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n3 <some data including white spaces>
<some data including white spaces>
END
Upvotes: 1