Gourav
Gourav

Reputation: 1265

Retrieve matched regex record-separator using Gnu AWK

Using AWK, I am processing a text file by splitting it into multiple records. As a record separator RS I use a regular expression. Is there a way to obtain the found record separator as RS only represents the regex string?

Example:

BEGIN { RS="a[0-9]*. "; ORS="\n-----\n"}
  /foo/ {print $0 RS;}
END {}

input file:

a1. Hello
this
is foo
a2. hello
this
is bar
a3. Hello
this
is foo

output:

Hello
this
is foo
a[0-9]*.
-----
Hello
this
is foo
a[0-9]*.
-----

As you see, the output is printing RS as a string representing the regular expression, but not printing the actual value. How can I retrieve the actual matched value of the record separator?

expected output:

Hello
this
is foo
a1
-----
Hello
this
is foo
a3
-----

Upvotes: 2

Views: 1211

Answers (3)

kvantour
kvantour

Reputation: 26501

In POSIX compliant AWK, the record separator RS is only a single character, hence it is easy to call it back in the form of.

awk 'BEGIN{RS="a"}{print $0 RS}'

GNU AWK, on the other hand, does not limit RS to be a one-character string but allows it to be any regular expression. In this case, it becomes a bit more tricky to use the above AWK because RS is a regular expression and not a string.

To this end, GNU AWK introduced the variable RT which represents nothing more than the found record separator. When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.

So naively, one could update your AWK program as:

BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 RT}

Unfortunately, RT is set to the value found after the current record and it seems the OP requests the value before the current record, hence you can introduce a new variable pRT which could be read as prevous record separator found.

BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT}

and as Shaki Siegal pointed out in the comments, you still have to update pRT to remove the final space and dot:

BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT;sub(/[.] $/,"",pRT)}

note: The original RS of the OP (RS="a[0-9]*. ") has been updated for an improved matching to RS="a[0-9]+[.] " This ensures the appearance of a number behind a and an actual ..

If, as the original example indicates, the record separator always appears at the beginning of the line, RS should be slightly modified into RS="(^|\n)a[0-9]+[.] "Dito comment also made various excellent points. So if the string a[0-9]+. appears always at the beginning, you need to process a bit more:

BEGIN {
   RS ="(^|\n)a[0-9]+[.] ";
   ORS="\n-----\n"
}
/foo/ {
   if (RT ~ /^$/ && NR != 2) pRT = substr(pRT,2)
   print $0 pRT 
}
{pRT=RT;sub(/[.] $/,"",pRT)}

Here, we added a correction to fix the last record.

  • If there are more then two AWK records (the first record is always empty), you need to remove the first new-line character from pRT, otherwise you include an extra new-line caused by the last record which ends with a new-line (in contrast to all others).
  • If there are only two AWK records (one effective in the text), then you should not do this correction as the first RT does not start with a new-line

The final improvement is done by realising that we always remove the initial newline in pRT if it is there, so we can merge it all in a single gsub:

BEGIN {
   RS ="(^|\n)a[0-9]+[.] ";
   ORS="\n-----\n"
}
/foo/ { print $0 pRT }
{pRT=RT;gsub(/^\n|[.] $/,"",pRT)}

RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.

The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.

ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.

RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.

source: GNU AWK manual

Upvotes: 5

Ed Morton
Ed Morton

Reputation: 204164

With GNU awk, which you're already using for multi-char RS, the builtin variable that contains the string that matched the RS regexp is RT.

We need to fix your RS setting though because you need a regexp for RS that matches a<integer><dot><blank> at the start of a line ((^|\n)a[0-9]+[.]) or a newline on it's own at the end of the file (\n$) so the last record in the file is parsed the same as all the rest and below is how to write that. Note that the RT will start with a newline for all except the very first match in the file so we need to strip that leading newline from RT to get the actual identifier we want to print for each record:

$ cat tst.awk
BEGIN {
    RS  = "(^|\n)a[0-9]+[.] |\n$"
    ORS = "\n-----\n"
}
/foo/ { print $0 "\n" id }
{ id = gensub(/^\n|[.] /,"","g",RT) }

Here's what it does given this input which includes more rainy-day cases than are present in the question (you should test other proposed solutions against this):

input:

$ cat file
a1. Hello
this
is foo bat man

a2. hello
this
is bar
a3. Hello
this is a7. just fine
is foo

output:

$ awk -f tst.awk file
Hello
this
is foo bat man

a1
-----
Hello
this is a7. just fine
is foo
a3
-----

Upvotes: 2

potong
potong

Reputation: 58478

This might work for you (GNU sed):

sed -rn '/^a[0-9]+\.\s/{:a;x;/foo/{s/^(a[0-9]+\.)\s*(.*)/\2\n\1\n-----/p;$d};x;h;b};H;$ba' file

Gather up lines that begin an. where n is an integer. If the line(s) contain the word foo make the required substitution and print the results otherwise do nothing.

Apology: When I began the solution the question was tagged sed.

When a line beginning an. is encountered, this line replaces whatever was in the hold space. However before it does, the hold space is first checked, and if it contains the word foo i.e. a collection already exists, the requirements to be processed are met and the so the lines are formatted as required and printed. Other lines are appended to the hold space. A special condition is met when the end-of-file is encountered which the is the same condition as when line beginning an. This is allowed for by the addition of a goto label :a.

Upvotes: 2

Related Questions