gen_im
gen_im

Reputation: 11

filter multiline record file based if one of the lines meet condition ( word count)

everyone

I am looking for a way to keep the records from txt file that meet the following condition:

This is the example of the data:

aa bb cc 
11 22 33 
44 55 66
77 88 99 

aa bb cc 
11 22 33 44 55 66 77
44 55 66 66
77 88 99

aa bb cc 
11 22 33 44 55
44 55 66 
77 88 99 77

...

Basically, it's a file where one record where there are total 5 lines, 4 lines contain strings/numbers with tab delimeter , and the last is the new line \n.

The first line of the record always has 3 elements, while the number of elements in 2nd 3rd and 4th line can be different.

What I need to do is to remove every record(5 lines block) where total number of elements in the second line > 3 ( and I don't care about the number of elements in all the rest lines) . The output of the example should look like this:

aa bb cc 
11 22 33 
44 55 66
77 88 99 

...

so only the record where i have 3 elements are kept and recorded in the new txt file.

I tried to do it with awk by modifying FS and RS values like this:

awk 'BEGIN {RS="\n\n"; FS="\n";}
{if(length($2)==3) print $2"\n\n"; }' test_filter.txt

but if(length($2)==3) is not correct, as I should count the number of entries in 2nd field instead of counting the length, which I can't find how to do.. any help would be much appreaciated!

thanks in advance,

Upvotes: 1

Views: 69

Answers (1)

markp-fuso
markp-fuso

Reputation: 34244

You can use the split() function to break a line/field/string into components; in this case:

n=split($2,arr," ")

Where:

  • we split field #2, using a single space (" ") as the delimiter ...
  • components are stored in array arr[] and ...
  • n is the number of elements in the array

Pulling this into OP's current awk code, along with a couple small changes, we get:

awk 'BEGIN {ORS=RS="\n\n"; FS="\n"} {n=split($2,arr," "); if (n>=4) next}1' test_filter.txt

With an additional block added to our sample:

$ cat test_filter.txt
aa bb cc
11 22 33
44 55 66
77 88 99

aa bb cc
11 22 33 44 55 66 77
44 55 66 66
77 88 99

aa bb cc
111 222 333
444 555 665
777 888 999

aa bb cc
11 22 33 44 55
44 55 66
77 88 99 77

This awk solution generates:

aa bb cc
11 22 33
44 55 66
77 88 99

aa bb cc
111 222 333
444 555 665
777 888 999
                   # blank line here

Upvotes: 2

Related Questions