Reputation: 1467
I want to extract all possible combinations of adjacent words separated by spaces – for one, two, three items. That is, convert this line:
a bc de fghi j
to
a
bc
de
fghi
j
a bc
bc de
de fghi
fghi j
a bc de
bc de fghi
de fghi j
How to do this with awk as fast as possible? I'm absolutely stuck and have no idea. I thought about something like match($0, /^([a-z]+)$|([^\s]+\s[^\s]+)|([^\s]+\s[^\s]+\s[^\s]+)/, arr)
but it can't work in this situation.
EDIT
The essential problem is how to combine this with using split
. For example,
{split($0, arr, ",");
for (i = 1; i <= length(arr); i++) {
print arr[i]
}
for (i = 1; i <= length(arr) - 1; i++) {
print arr[i] " " arr(i+1)
}
for (i = 1; i <= length(arr) - 2; i++) {
print arr[i] " " arr[i+1] " " arr[i+2]
}
}
gives Call to undefined function
Upvotes: 1
Views: 101
Reputation: 785531
You can use this perl command using lookahead:
s='a bc de fghi j'
perl -ne 'print join "\n" =>$_ =~ /(?=\b(\w+)\b)/g; print "\n";
print join "\n" =>$_ =~ /(?=\b(\w+\s+\w+)\b)/g; print "\n";
print join "\n" =>$_ =~ /(?=\b(\w+\s+\w+\s+\w+)\b)/g; print "\n"' <<< "$s"
a
bc
de
fghi
j
a bc
bc de
de fghi
fghi j
a bc de
bc de fghi
de fghi j
Upvotes: 1
Reputation: 75585
Here is a somewhat verbose awk
script that will generate the output based on the input you gave.
{
for (i = 1; i <= NF; i++) {
print $i
}
for (i = 1; i <= NF - 1; i++) {
print $i " " $(i+1)
}
for (i = 1; i <= NF - 2; i++) {
print $i " " $(i+1) " " $(i+2)
}
}
Run it like this:
awk -f Extract.awk Input.txt
Here is a more general version that works for k > 3
adjacent words.
function join(array, start, end, sep, result, i)
{
if (sep == "")
sep = " "
else if (sep == SUBSEP) # magic value
sep = ""
result = array[start]
for (i = start + 1; i <= end; i++)
result = result sep array[i]
return result
}
{
for (i = 1; i <= NF; i++) {
a[i] = $i
}
for (k = 0; k < 3; k++) {
for (i = 1; i <= NF - k; i++) {
result = join(a, i, i + k, " ")
print result
}
}
}
Upvotes: 1