Chintan Parikh
Chintan Parikh

Reputation: 21

Match expression across multiple lines in shell script

I wish to match a pattern across multiple lines in a shell script. My input is as:

START <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
ID: n1 <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
END

START <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
ID: n2 <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
END

I am trying to display the output using regex for a specific ID only (eg. n1 or n2). I tried START(.|\n)*ID: n1(.|\n)*END regex but it fetches the data of ID: n2 as well. What changes should I make to the regex inorder to get data of only the specific ID?

I am using cat inputfile | grep 'pattern' > outputfile as the command.

The number of lines in each block as well as the number of lines between START and ID: n1, ID: n1 and END can be variable and hence using head/tail is not a viable option. Also, I would like to print the whole block from START to END when the ID is matched.

EDIT: I tried using an Online Regex Creator and it could successfully match the regex

START[\s\S][^END]*ID: n1[\s\S][^END]*END

on my input file.

Upvotes: 1

Views: 1053

Answers (3)

mklement0
mklement0

Reputation: 437062

A GNU awk or Mawk solution that can handle any number of lines, including empty ones, between paired START and END occurrences:

awk -v id='n2' -v RS='(^|\n)START |\nEND' '
  $0 ~ ("\nID: " id " ") { print "START " $0 "\nEND" }
' file

This solution uses a multi-character RS value (that is also a regex), which is not supported in the POSIX spec. Both GNU awk and Mawk (the default awk on Ubuntu) support such values, however, whereas BSD/macOS awk does not.

  • -v id='n2' passes ID value n2 as variable id to Awk.

  • RS='(^|\n)START |\nEND' breaks the input into records by (line-spanning) text between tokens START  at the start of the input / a line and token END after a newline.

  • $0 ~ ("\nID: " id " ") matches each input record ($0) against a regex (~) that matches the specified ID: a newline followed by ID: , followed by the ID value of interest (stored in variable id) and a space.
    Note how string concatenation in Awk works by simply placing strings / variable references next to each other.

  • In case of a match, print "START " $0 "\nEND" prints the input record at hand, bookended by the START and END tokens (which, as the input record separators, doesn't report as part of $0).


If the lines between paired START and END occurrences are all nonempty (i.e., contain at least 1 char., even if that char. is a space or tab), here's a POSIX-compliant awk solution:

awk -v id='n2' -v RS= '$0 ~ ("\nID: " id " ")' file

Note that -v RS=, i.e., setting the input record separator (RS) to the empty string, is an awk idiom that breaks the input into records by paragraphs (runs of nonempty lines).

Upvotes: 1

dawg
dawg

Reputation: 103704

In awk you can accumulate the text between your starting pattern and ending pattern and then test that buffer for your match:

cat inputfile | awk  '/^START/        { buf=$0 "\n"; flag=1; next } 
                      flag            { buf=buf $0 "\n" } 
                      /^END/ && flag  { flag=0; if (buf ~ /ID: n1 |ID: n2 /) print buf }'

In Perl you can do:

cat inputfile | perl -0777 -lne 'while (/(^START.*?^ID: (n\d+) .*?^END)/gms){
    if ($2 eq "n1" || $2 eq "n2"){
        print "$1\n\n";
    }
}'

In either case, you may want to do awk '{script}' inputfile or perl '{script}' inputfile rather than using cat

Upvotes: 0

heemayl
heemayl

Reputation: 41987

awk in paragraph mode, using two successive newlines as record separator:

awk -v RS='\n\n' '/ID: n1/' file.txt

Replace n1 with n2, n3... for others.

Example:

$ cat file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n1 <some data including white spaces>
<some data including white spaces>
END

START <some data including white spaces>
<some data including white spaces>
ID: n2 <some data including white spaces>
<some data including white spaces>
END

START <some data including white spaces>
<some data including white spaces>
ID: n3 <some data including white spaces>
<some data including white spaces>
END


$ awk -v RS='\n\n' '/ID: n1/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n1 <some data including white spaces>
<some data including white spaces>
END


$ awk -v RS='\n\n' '/ID: n2/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n2 <some data including white spaces>
<some data including white spaces>
END


$ awk -v RS='\n\n' '/ID: n3/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n3 <some data including white spaces>
<some data including white spaces>
END

Upvotes: 1

Related Questions