Polucho
Polucho

Reputation: 23

AWK: Print lines matching a pattern

I have a tab separated file where the last fifteen fields are formed of zeros and ones. What it's need to do is print lines that do not contain more than five consecutive zeros or more than five consecutive ones, between those fifteen fields separated by groups of five fields.

File:

abadenguísimo   abadenguísimo   adjective   n/a n/a singular    n/a masculine   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
abalaustradísimo    abalaustradísimo    adjective   n/a n/a singular    n/a masculine   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
abiertísimas    abiertísimo adjective   n/a n/a plural  n/a feminine    1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
abellacadísimo  abellacadísimo  adjective   n/a n/a singular    n/a masculine   1   0   1   1   1   0   0   1   0   0   1   0   0   0   0
cansonísimos    cansonísimo adjective   n/a n/a plural  n/a masculine   0   1   1   1   0   0   0   0   1   0   0   0   0   0   1

Output:

abellacadísimo  abellacadísimo  adjective   n/a n/a singular    n/a masculine   1   0   1   1   1   0   0   1   0   0   1   0   0   0   0
cansonísimos    cansonísimo adjective   n/a n/a plural  n/a masculine   0   1   1   1   0   0   0   0   1   0   0   0   0   0   1

I tried this:

BEGIN {
    FS = "\t"

    }
    {
    a=0;
    b=0;
    c=0;

    num[A]="";
    num[B]="";
    num[C]="";


        for ( i = 9; i <= 13; i++)
            num[A]=num[A]""$i;
        for (j = 14; j <= 18; j++)
            num[B]=num[B]""$j;
        for (k = 19; k <= 23; k++)
            num[C]=num[C]""$k;



    if ((num[A] != "00000") && (num[A] != "11111")) {
        a=1;
    }
    if (num[B] != "00000") {
        b=1;
    }
    if (num[C] != "00000") {
        c=1;
    }
    if ((a == 1) || (b == 1) || (c == 1)) {
        print;
    }
    }

Finally I think I've found a solution, I don't know why the other code doesn't work for me.

BEGIN {
FS = "\t"
cont=0;
}

{
a=0;
b=0;
c=0;

sum1=$9+$10+$11+$12+$13;
sum2=$14+$15+$16+$17+$18;
sum3=$19+$20+$21+$22+$23;

if (( sum1 > 0 ) && ( sum1 < 5 )) {
a=1;
}
if ( sum2 > 0 ) {
b=1;
}
if ( sum3 > 0 ) {
c=1;
}

if ((a == 1) || (b == 1) || (c == 1)) {

cont++;
print;
}

}

END {
print "Total: "NR;
print "OK: "cont; 
}

Upvotes: 0

Views: 116

Answers (3)

ghoti
ghoti

Reputation: 46896

The following ERE in grep works with your input data, where ALL THREE groups of five have matching content:

egrep -v '(\s+[01])\1\1\1\1(\s+[01])\2\2\2\2(\s+[01])\3\3\3\3' file

Since your question is tagged , though, let's express this in awk.

We can't do the same thing in awk, because awk traditionally does not support backreferences in regular expressions. So as your script suggests, doing this programmatically may be the answer. Your solution concatenates fields and compares strings. I think I would probably use arithmetic instead -- a sum of the five fields is a number from zero to five. A value of zero or five means "skip", anything else means "print".

#!/usr/bin/awk -f

{

  # Count back from the end in groups of five, until we hit e field
  # that is neither "0" nor "1"...
  start=NF;
  while ($start ~ /^[01]$/) {
    group++;
    for(i=start;i>start-5;i--) { sum[group]+=$i; }
    start=i;
  }

  # Step through groups, adding a condition to a counter.
  # At the end of the loop, if found > 0, then we've found a line
  # that does not have the pattern specified.
  found=0;
  while (--group) {
    found+=(sum[group] > 1 && sum[group] < 5);
  }

}

# If found > 0, print the line.
found

Upvotes: 0

bian
bian

Reputation: 1456

awk 4

awk 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file

awk 3.1

awk --posix 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file

update

awk '{for(i=9;i<=NF;i++){a[$i];if(++c==5){l=length(a);delete a;c=0;if(l>1){print;break}}}}' file

Upvotes: 0

Kent
Kent

Reputation: 195269

if you translate your requirement from english into regex then give to grep, it will do what you want:

grep -vE '(1\s+){6,}|(0\s+){6,}' file

You can adjust the \s+, for example change it to \t or something else for your needs.

Update

awk -F'\t' '{s=NF-15+1
            c=i=0
            while(++c<=3){
                    x=i?i:s 
                    t=0
                    for(i=x;i<x+5;i++) t+=$i+0
                    if(t==0||t==5) next
            }
            print
    }' file

This give your the expected output. It checks the "more than FOUR consecutive zeros/ones" instead of FIVE, because each group has max. 5 elements/columns, ">5" will never happen.

Upvotes: 1

Related Questions