Reputation: 23
I have a tab separated file where the last fifteen fields are formed of zeros and ones. What it's need to do is print lines that do not contain more than five consecutive zeros or more than five consecutive ones, between those fifteen fields separated by groups of five fields.
File:
abadenguísimo abadenguísimo adjective n/a n/a singular n/a masculine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
abalaustradísimo abalaustradísimo adjective n/a n/a singular n/a masculine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
abiertísimas abiertísimo adjective n/a n/a plural n/a feminine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
abellacadísimo abellacadísimo adjective n/a n/a singular n/a masculine 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0
cansonísimos cansonísimo adjective n/a n/a plural n/a masculine 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1
Output:
abellacadísimo abellacadísimo adjective n/a n/a singular n/a masculine 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0
cansonísimos cansonísimo adjective n/a n/a plural n/a masculine 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1
I tried this:
BEGIN {
FS = "\t"
}
{
a=0;
b=0;
c=0;
num[A]="";
num[B]="";
num[C]="";
for ( i = 9; i <= 13; i++)
num[A]=num[A]""$i;
for (j = 14; j <= 18; j++)
num[B]=num[B]""$j;
for (k = 19; k <= 23; k++)
num[C]=num[C]""$k;
if ((num[A] != "00000") && (num[A] != "11111")) {
a=1;
}
if (num[B] != "00000") {
b=1;
}
if (num[C] != "00000") {
c=1;
}
if ((a == 1) || (b == 1) || (c == 1)) {
print;
}
}
Finally I think I've found a solution, I don't know why the other code doesn't work for me.
BEGIN {
FS = "\t"
cont=0;
}
{
a=0;
b=0;
c=0;
sum1=$9+$10+$11+$12+$13;
sum2=$14+$15+$16+$17+$18;
sum3=$19+$20+$21+$22+$23;
if (( sum1 > 0 ) && ( sum1 < 5 )) {
a=1;
}
if ( sum2 > 0 ) {
b=1;
}
if ( sum3 > 0 ) {
c=1;
}
if ((a == 1) || (b == 1) || (c == 1)) {
cont++;
print;
}
}
END {
print "Total: "NR;
print "OK: "cont;
}
Upvotes: 0
Views: 116
Reputation: 46896
The following ERE in grep works with your input data, where ALL THREE groups of five have matching content:
egrep -v '(\s+[01])\1\1\1\1(\s+[01])\2\2\2\2(\s+[01])\3\3\3\3' file
Since your question is tagged awk, though, let's express this in awk.
We can't do the same thing in awk, because awk traditionally does not support backreferences in regular expressions. So as your script suggests, doing this programmatically may be the answer. Your solution concatenates fields and compares strings. I think I would probably use arithmetic instead -- a sum of the five fields is a number from zero to five. A value of zero or five means "skip", anything else means "print".
#!/usr/bin/awk -f
{
# Count back from the end in groups of five, until we hit e field
# that is neither "0" nor "1"...
start=NF;
while ($start ~ /^[01]$/) {
group++;
for(i=start;i>start-5;i--) { sum[group]+=$i; }
start=i;
}
# Step through groups, adding a condition to a counter.
# At the end of the loop, if found > 0, then we've found a line
# that does not have the pattern specified.
found=0;
while (--group) {
found+=(sum[group] > 1 && sum[group] < 5);
}
}
# If found > 0, print the line.
found
Upvotes: 0
Reputation: 1456
awk 4
awk 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file
awk 3.1
awk --posix 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file
update
awk '{for(i=9;i<=NF;i++){a[$i];if(++c==5){l=length(a);delete a;c=0;if(l>1){print;break}}}}' file
Upvotes: 0
Reputation: 195269
if you translate your requirement from english into regex then give to grep, it will do what you want:
grep -vE '(1\s+){6,}|(0\s+){6,}' file
You can adjust the \s+
, for example change it to \t
or something else for your needs.
awk -F'\t' '{s=NF-15+1
c=i=0
while(++c<=3){
x=i?i:s
t=0
for(i=x;i<x+5;i++) t+=$i+0
if(t==0||t==5) next
}
print
}' file
This give your the expected output. It checks the "more than FOUR consecutive zeros/ones" instead of FIVE, because each group has max. 5 elements/columns, ">5" will never happen.
Upvotes: 1