How to extract one/two/three adjacent words from a line?

Question

I want to extract all possible combinations of adjacent words separated by spaces – for one, two, three items. That is, convert this line:

a bc de fghi j

to

a
bc
de
fghi
j
a bc
bc de
de fghi
fghi j
a bc de
bc de fghi
de fghi j

How to do this with awk as fast as possible? I'm absolutely stuck and have no idea. I thought about something like match($0, /^([a-z]+)$|([^\s]+\s[^\s]+)|([^\s]+\s[^\s]+\s[^\s]+)/, arr) but it can't work in this situation.

EDIT The essential problem is how to combine this with using split. For example,

{split($0, arr, ",");
for (i = 1; i <= length(arr); i++) {
    print arr[i]
}
for (i = 1; i <= length(arr) - 1; i++) {
    print arr[i] " " arr(i+1)
}
for (i = 1; i <= length(arr) - 2; i++) {
    print arr[i] " " arr[i+1] " " arr[i+2]
}
}

gives Call to undefined function

merlin2011 · Accepted Answer

Here is a somewhat verbose awk script that will generate the output based on the input you gave.

{
for (i = 1; i <= NF; i++) {
    print $i
}
for (i = 1; i <= NF - 1; i++) {
    print $i " " $(i+1)
}
for (i = 1; i <= NF - 2; i++) {
    print $i " " $(i+1) " " $(i+2)
}
}

Run it like this:

awk -f Extract.awk Input.txt

Here is a more general version that works for k > 3 adjacent words.

function join(array, start, end, sep, result, i)
{
    if (sep == "")
       sep = " "
    else if (sep == SUBSEP) # magic value
       sep = ""
    result = array[start]
    for (i = start + 1; i <= end; i++)
        result = result sep array[i]
    return result
}

{
for (i = 1; i <= NF; i++) {
    a[i] = $i    
}
for (k = 0; k < 3; k++) {
for (i = 1; i <= NF - k; i++) {
   result = join(a, i, i + k, " ") 
   print result
}
}
}

How to extract one/two/three adjacent words from a line?

Answers (2)

Related Questions