discipulus
discipulus

Reputation: 2715

delete only lines that contain specific pattern

I have a tab delimited file with many number of features. I want to delete least informative lines. To be specific I want to delete lines that all have question mark (?) in all the columns except the last which can have yes or a no. My file looks like

a   b   c   frequent
?   ?   ?   No
?   ?   1   Yes
1   ?   1   No
?   1   1   Yes
?   ?   ?   No
?   ?   ?   Yes

I want to delete columns that have

?   ?   ?   No 

or

?   ?   ?   Yes

I can use

sed '/pattern/d' ./ file

I However how do I use it for multiple copies of ?. There can be hundreds of column so solutions such as

sed '/?  ?  ?  No/d' ./ file

and

sed '/?  ?  ?  Yes/d' ./ file

will not work. I want my output to look like

a   b   c   frequent
?   ?   1   Yes
1   ?   1   No
?   1   1   Yes

EDIT 1: For columns in a tab delimited file with first column as serial number and last column as space delimited class labels. I want to consider second to second to last rows and remove columns that have all question marks.

No  a   b   c   itemname
1   ?   ?   ?   frying pan
2   ?   ?   1   t-shirt
3   1   ?   1   microwave oven
10  ?   1   1   forks and knives
11  ?   ?   ?   gold
12  ?   ?   ?   chain

The wanted output is

No  a   b   c   itemname
2   ?   ?   1   t-shirt
3   1   ?   1   microwave oven
10  ?   1   1   forks and knives

Upvotes: 1

Views: 1391

Answers (5)

Håkon Hægland
Håkon Hægland

Reputation: 40748

Regarding your last update, you can modify the solution of @Jotne as follows:

NR==1 {
    p=NF-2
    next
}
{
    for (i=1;i<=p;i++) {
        if (!( $(i+1)=="?")) f=1
    }
}
f {
    print
    f=x
}

Upvotes: 0

Mohammad AbuShady
Mohammad AbuShady

Reputation: 42799

You can try this to handle both cases in one step

 sed -r '/(\?\s+){3}(Yes|No)/d' ./file

EDIT:

Regarding the number of ? per line, you can just replace {3} with + if you want "one or more" or use {3,} if you want something like "3 or more", or you can use {3,5} for example if you want to say "between 3 and 5"

EDIT2:

This is a grep alternative

egrep -v '(\?\s+){3}(Yes|No)' ./fileToTest > outputFile

Note:

The reason sed wasn't working is because we need extended regex after checking sed's help I found it's the flag -r

Upvotes: 3

Idriss Neumann
Idriss Neumann

Reputation: 3838

Using awk :

[ ~]$ cat test.txt 
a   b   c   frequent
?   ?   ?   No
?   ?   1   Yes
1   ?   1   No
?   1   1   Yes
?   ?   ?   No
?   ?   ?   Yes
[ ~]$ awk '!($0 ~ "?\\ *?\\ *?\\ *(Yes|No)"){print}' test.txt
a   b   c   frequent
?   ?   1   Yes
1   ?   1   No
?   1   1   Yes
[ ~]$ 

You could also use egrep like this :

[ ~]$ egrep  -v "\?\ *\?\ *\?\ *(Yes|No)" test.txt 
a   b   c   frequent
?   ?   1   Yes
1   ?   1   No
?   1   1   Yes

Upvotes: 0

Jotne
Jotne

Reputation: 41456

Escape the ?

sed '/\? +\? +\? +Yes/d' file

Since your file seems to be space separated with multiple space, you need +

Or if you have tab

sed '/\?\t\?\t\?\tNo/d' file

An awk solution to delete lines that only have ?

awk '{for (i=1;i<NF;i++) if ($i!~"?") f=1} f {print;f=x}' file

Or using aragaers approach, print only lines with at least one 1

awk '/1/ || NR==1' file
a   b   c   frequent
?   ?   1   Yes
?   ?   1   Yes

Upvotes: 3

aragaer
aragaer

Reputation: 17848

Is it guaranteed that column contains either ? or 1? If yes, simply delete everything unless it contains at least one 1 (and is not a first line):

sed -n '1p; /1/p;' file

Upvotes: 2

Related Questions