mattbawn
mattbawn

Reputation: 1378

Bash exclude lines where proportion of columns contain matched value

I have a lage text file that I would like to filter by excluding lines that have a number of columns matching a certain character. I had previously removed lines where all columns from 2 onwards contained a 0 or a . like so:

awk '{
    for (i=2; i<=NF; i++)
        if ($i!~/^(\.|0)/) {
            print
            break
        }
}'

but now I would like it so that I would print lines that had less than a specific number of columns with this value (".").

For example with data:

A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
0 0 . . 0
. ./. . . .

and a match value of 2 I would expect the bottom two lines to be excluded so that the output would be:

A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0

Any ideas?

Upvotes: 1

Views: 247

Answers (5)

Claes Wikner
Claes Wikner

Reputation: 1517

Perhaps this is alright.

    awk '$0 !~/\. \./' file
    A B C D E
    0 1 . 0 0
    1 ./. 0 1 1
    1 1 0 0 0

Upvotes: -1

Andreas Louv
Andreas Louv

Reputation: 47119

With awk:

$ awk '{c=0;for(i=1;i<NF;i++) c += ($i == ".")}c<2' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0

Basically it iterates each column and add one to the counter if the column equals a period (.).

The c<2 part will only print the line if there is less than two columns with periods.

With sed one can use:

$ sed -r 'h;s/[^. ]+//g;s/\.\. *//g;/\. \./d;x' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0

-r enables extended regular expressions (-E on *BSD).

Basically a copy of the pattern space is stored in the hold buffer, then all but spaces and periods is removed.

Now if the pattern space contains two separate periods it can be deleted if not the pattern space can be exchanged with the hold buffer.

Upvotes: 3

James Brown
James Brown

Reputation: 37424

$ awk '{delete a; for(i=1;i<=NF;i++) a[$i]++; if(a["."]>=2) next} 1' foo
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0

It iterates all fields (for), counts field values and if 2 or more . in a record, restrains from printing (next). If you want to count the periods only from field 3 onward, change the start value of i in the for: for(i=3; ...).

Upvotes: 2

Mark Setchell
Mark Setchell

Reputation: 207670

Similar to @spasic's answer, but easier (for me) to read!

perl -ane 'print if (grep { /^\.$/} @F) < 2' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0

The -a separates the space-separated fields into an array called @F for me. I then grep in the array @F looking for elements that consist of just a period - i.e. those that start with a period and end immediately after the period. That counts the lone periods in each line and I print the line if that number is less than 2.

Upvotes: 1

Sundeep
Sundeep

Reputation: 23677

$ cat ip.txt 
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
0 0 . . 0
. ./. . . .

$ perl -ne '(@c)=/\.\/\.|\./g; print if $#c < 1' ip.txt 
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
  • (@c)=/\.\/\.|\./g array of ./. or . matches from current line
  • $#c indicates index of last element, i.e (size of array - 1)
  • So, to ignore lines containing 3 elements like ./. or . use $#c < 2

Upvotes: 1

Related Questions