fedorqui
fedorqui

Reputation: 290105

How come the POSIX mode of GNU Awk does not consider a new line a field, when setting the RS to another thing?

I was going through the GNU Awk User's Guide and found this in the 4.1.1 Record Splitting with Standard awk section:

When using regular characters as the record separator, there is one unusual case that occurs when gawk is being fully POSIX-compliant (see section Command-Line Options). Then, the following (extreme) pipeline prints a surprising ‘1’:

$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
-| 1

There is one field, consisting of a newline. The value of the built-in variable NF is the number of fields in the current record. (In the normal case, gawk treats the newline as whitespace, printing ‘0’ as the result. Most other versions of awk also act this way.)

I checked it but it does not work to me on my GNU Awk 5.0.0:

$ gawk --version
GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
0

That is, the behaviour is exactly the same as without the POSIX mode:

$ echo | gawk 'BEGIN { RS = "a" } ; { print NF }'
0

I understand the point it makes, in which the content of just a new line is considered as a field when the record separator is not the default (that is, it is not a new line). However, I cannot reproduce it.

How should I reproduce the example? I also tried with gawk --traditional or gawk -P but the result was always 0.

Since the GNU Awk User's guide I was checking is for the 5.1 version and I have the 5.0.0, I also checked an archived version for 5.0.0 and it shows the same lines, so it is not something that changed between 5.0 and 5.1.

Upvotes: 2

Views: 395

Answers (1)

kvantour
kvantour

Reputation: 26501

When reading the POSIX standard, then we find:

The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non-<blank> non-<newline> characters. This default <blank> and <newline> field delimiter can be changed by using the FS built-in variable

If FS is <space>, skip leading and trailing <blank> and <newline> characters; fields shall be delimited by sets of one or more <blank> or <newline> characters.

source: POSIX awk standard: IEEE Std 1003.1-2017

Having that said, the proper behaviour should be the following:

$ echo | awk 'BEGIN{RS="a"}{print NR,NF,length}'
1 0 1
  • a single record: no <a>-character has been encountered
  • no fields: FS is the default space so all leading and trailing <blank> and <newline> characters; are skipped
  • length one: there is only a single character in the record.

When defining the FS, the story is completely different:

$ echo | awk 'BEGIN{FS="b";RS="a"}{print NR,NF,length}'
1 1 1
$ echo | awk 'BEGIN{FS="\n";RS="a"}{print NR,NF,length}'
1 2 1

In conclusion: I believe the GNU awk documentation is wrong.

Upvotes: 2

Related Questions