user31264
user31264

Reputation: 6737

In AWK (or GAWK), how to determine where does n-th word start?

Example: n=3, the input is

foo  bar    baz
a b c d e f g h
12  34 5 678

the output should be:

13
5
8

Upvotes: 2

Views: 63

Answers (3)

Ed Morton
Ed Morton

Reputation: 204258

This will work for any field separator, including multi-char regexp, using GNU awk for the 4th arg to split():

$ cat tst.awk
{    
    split($0,flds,FS,seps)
    indent = 1
    for (i=0; i<n; i++) {
        indent += length(flds[i] seps[i])
    }
    print indent
}

$ awk -v n=3 -f tst.awk file
13
5
8

or with multi-char strings of .+. or .-. between fields:

$ cat file2
foo.+.bar.-.baz
a.+.b.-.c.+.d.-.e.+.f.-.g.+.h
12.-.34.+.5.-.678

$ awk -F'[.][+-][.]' -v n=3 -f tst.awk file2
13
9
11

Note that since we're using FS as an argument to split() it will be treated as a dynamic regexp (i.e. one stored in a string) and so any backslashes in the FS would need to be doubled.

Also note that we start the counting loop at 0, not 1, because with the default FS any leading white space before flds[1] (i.e. before $1) is stored in seps[0]. flds[0] will always be empty and for non-default FS seps[0] will also be empty to no harm done including their length in all cases.

Upvotes: 1

Tom Fenech
Tom Fenech

Reputation: 74685

You can use match to do this:

$ awk 'match($0, /[[:blank:]]*([^[:blank:]]+[[:blank:]]+){2}/) {
    print RLENGTH + 1 
}' file
13
5
8

Or using a parameter with a dynamic regex:

$ awk -v n=3 'match($0, "[[:blank:]]*([^[:blank:]]+[[:blank:]]+){" n - 1 "}") { 
    print RLENGTH + 1 
}' file
13
5
8

This searches for optional leading blanks (spaces or tabs), followed by something non-blank, followed by something blank, n - 1 times, where n is the word number. match sets the variables RSTART and RLENGTH (in this case, RSTART == 1). RLENGTH gives the length of the match, so one character after that is where the nth word starts.

Since you mentioned GNU awk, you can shorten things by using \s (which is actually [[:space:]], but that works here too) and non-space \S:

$ awk -v n=3 'match($0, "\\s*(\\S+\\s+){" n - 1 "}") { print RLENGTH + 1 }' file

In dynamic regex, the backslashes themselves need to be escaped.

Upvotes: 3

James Brown
James Brown

Reputation: 37454

The simplest would probably be:

$ awk -v n=3 '{print index($0,$n)}' file
13
5
8

but it's error prone, and would require some checking. $n is the third word (or field separated by FS the field separator). index returns the position in characters where that occurrence begins. If the FS is default (space and then some) you'd probably want to start with a space and add one to the position:

$ awk -v n=3 '{print 1 + index($0," " $n)}' file
13
5
8

... as pointed out in the comments is also error prone to n=1 or if the nth word matches the beginning of a prior word.

We could use GNU awk's split's seps feature:

$ awk -v n=3 '{
    s=1                      # reset s to 1
    split($0,a,/ +/,b)       # split to a and separators to b
    for(i=1;i<n;i++)         # iterate to n
        s+=length(a[i] b[i]) # sum the lengths of a b
    print s                  # print the position
}' file
13
5
8

Upvotes: 3

Related Questions