Reputation: 1265
Using AWK, I am processing a text file by splitting it into multiple records. As a record separator RS
I use a regular expression. Is there a way to obtain the found record separator as RS
only represents the regex string?
Example:
BEGIN { RS="a[0-9]*. "; ORS="\n-----\n"}
/foo/ {print $0 RS;}
END {}
input file:
a1. Hello
this
is foo
a2. hello
this
is bar
a3. Hello
this
is foo
output:
Hello
this
is foo
a[0-9]*.
-----
Hello
this
is foo
a[0-9]*.
-----
As you see, the output is printing RS
as a string representing the regular expression, but not printing the actual value.
How can I retrieve the actual matched value of the record separator?
expected output:
Hello
this
is foo
a1
-----
Hello
this
is foo
a3
-----
Upvotes: 2
Views: 1211
Reputation: 26501
In POSIX compliant AWK, the record separator RS
is only a single character, hence it is easy to call it back in the form of.
awk 'BEGIN{RS="a"}{print $0 RS}'
GNU AWK, on the other hand, does not limit RS
to be a one-character string but allows it to be any regular expression. In this case, it becomes a bit more tricky to use the above AWK because RS
is a regular expression and not a string.
To this end, GNU AWK introduced the variable RT
which represents nothing more than the found record separator. When RS
is a single character, RT
contains the same single character. However, when RS
is a regular expression, RT
contains the actual input text that matched the regular expression.
So naively, one could update your AWK program as:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 RT}
Unfortunately, RT
is set to the value found after the current record and it seems the OP requests the value before the current record, hence you can introduce a new variable pRT
which could be read as prevous record separator found.
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT}
and as Shaki Siegal pointed out in the comments, you still have to update pRT
to remove the final space and dot:
BEGIN{RS="a[0-9]+[.] "; ORS="\n-----\n"}
/foo/{print $0 pRT}{pRT=RT;sub(/[.] $/,"",pRT)}
note: The original RS
of the OP (RS="a[0-9]*. "
) has been updated for an improved matching to RS="a[0-9]+[.] "
This ensures the appearance of a number behind a
and an actual .
.
If, as the original example indicates, the record separator always appears at the beginning of the line, RS
should be slightly modified into RS="(^|\n)a[0-9]+[.] "
Dito comment also made various excellent points. So if the string a[0-9]+.
appears always at the beginning, you need to process a bit more:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ {
if (RT ~ /^$/ && NR != 2) pRT = substr(pRT,2)
print $0 pRT
}
{pRT=RT;sub(/[.] $/,"",pRT)}
Here, we added a correction to fix the last record.
pRT
, otherwise you include an extra new-line caused by the last record which ends with a new-line (in contrast to all others).RT
does not start with a new-lineThe final improvement is done by realising that we always remove the initial newline in pRT
if it is there, so we can merge it all in a single gsub
:
BEGIN {
RS ="(^|\n)a[0-9]+[.] ";
ORS="\n-----\n"
}
/foo/ { print $0 pRT }
{pRT=RT;gsub(/^\n|[.] $/,"",pRT)}
RS
: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.The ability for
RS
to be a regular expression is agawk
extension. In most other AWK implementations, or ifgawk
is in compatibility mode (see Options), just the first character ofRS
’s value is used.
ORS
: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT
: (GNU AWK specific) The input text that matched the text denoted byRS
, the record separator. It is set every time a record is read.source: GNU AWK manual
Upvotes: 5
Reputation: 204164
With GNU awk, which you're already using for multi-char RS, the builtin variable that contains the string that matched the RS regexp is RT
.
We need to fix your RS setting though because you need a regexp for RS that matches a<integer><dot><blank>
at the start of a line ((^|\n)a[0-9]+[.]
) or a newline on it's own at the end of the file (\n$
) so the last record in the file is parsed the same as all the rest and below is how to write that. Note that the RT will start with a newline for all except the very first match in the file so we need to strip that leading newline from RT to get the actual identifier we want to print for each record:
$ cat tst.awk
BEGIN {
RS = "(^|\n)a[0-9]+[.] |\n$"
ORS = "\n-----\n"
}
/foo/ { print $0 "\n" id }
{ id = gensub(/^\n|[.] /,"","g",RT) }
Here's what it does given this input which includes more rainy-day cases than are present in the question (you should test other proposed solutions against this):
input:
$ cat file
a1. Hello
this
is foo bat man
a2. hello
this
is bar
a3. Hello
this is a7. just fine
is foo
output:
$ awk -f tst.awk file
Hello
this
is foo bat man
a1
-----
Hello
this is a7. just fine
is foo
a3
-----
Upvotes: 2
Reputation: 58478
This might work for you (GNU sed):
sed -rn '/^a[0-9]+\.\s/{:a;x;/foo/{s/^(a[0-9]+\.)\s*(.*)/\2\n\1\n-----/p;$d};x;h;b};H;$ba' file
Gather up lines that begin an.
where n is an integer. If the line(s) contain the word foo
make the required substitution and print the results otherwise do nothing.
Apology: When I began the solution the question was tagged sed
.
When a line beginning an.
is encountered, this line replaces whatever was in the hold space. However before it does, the hold space is first checked, and if it contains the word foo
i.e. a collection already exists, the requirements to be processed are met and the so the lines are formatted as required and printed. Other lines are appended to the hold space. A special condition is met when the end-of-file is encountered which the is the same condition as when line beginning an.
This is allowed for by the addition of a goto label :a
.
Upvotes: 2