sdaau
sdaau

Reputation: 38701

regex split string and keep delimiters in awk

As far as I can see, if I want to split a string with regex, and keep the delimiters in Perl, JavsScript or PHP, I should use capturing parentheses / group in the regex; e.g. in Perl (where I want to split at a single digit and right parenthesis):

$ echo -e "123.123   1)  234.234\n345.345   0)  456.456" \
| perl -ne 'print join("--", split(/(\d\))/,$_));'
123.123   --1)--  234.234
345.345   --0)--  456.456

I'm trying the same trick in awk, but it doesn't look like it works (as in, the delimiters are still "eaten", even if a capturing group/parentheses are used):

$ echo -e "123.123   1)  234.234\n345.345   0)  456.456" \
| awk '{print; n=split($0,a,/([0-9]\))/);for(i=1;i<=n;i++){print i,a[i];}}'
123.123   1)  234.234
1 123.123   
2   234.234
345.345   0)  456.456
1 345.345   
2   456.456

Can awk be forced to keep the delimiter matches in the array which is the result of split?

Upvotes: 0

Views: 4554

Answers (2)

Ed Morton
Ed Morton

Reputation: 204638

As @konsolebox mentioned you can use split() with newer gawk versions to save field separator values. You could also take a look at FPAT and patsplit(). Another alternative would be to set the RS to your current FS and then use RT.

Having said that, I don't understand why you're thinking of a solution involving field separators when you could solve the problem you posted with just a gensub() in gawk:

$ echo -e "123.123   1)  234.234\n345.345   0)  456.456" |
gawk '{print gensub(/[[:digit:]])/,"--&--","")}'
123.123   --1)--  234.234
345.345   --0)--  456.456

If there's a different problem you're really trying to solve that'd require remembering the FS values, let us know and we can point you in the right direction.

Upvotes: 1

konsolebox
konsolebox

Reputation: 75618

You can use split() in gawk e.g

echo -e "123.123   1)  234.234\n345.345   0)  456.456" |
gawk '{
    nf = split($0, a, /[0-9]\)/, seps)
    for (i = 1; i < nf; ++i) printf "%s--%s--", a[i], seps[i]
    print a[i]
}'

Output:

123.123   --1)--  234.234
345.345   --0)--  456.456

The version of the function in GNU awk (gawk) accepts another optional array name argument in which if present saves the matched separators to the array.

As noted in Gawk's manual:

split(s, a [, r [, seps] ])

Split the string s into the array a and the separators array seps on the regular expression r, and return the number of
fields.  If r is omitted, FS is used instead.  The arrays a and seps are cleared first.  seps[i] is the field separator
matched by r between a[i] and a[i+1].  If r is a single space, then leading whitespace in s goes into the extra array element
seps[0] and trailing whitespace goes into the extra array element seps[n], where n is the return value of split(s, a, r,
seps).  Splitting behaves identically to field splitting, described above.

Upvotes: 4

Related Questions