Reputation: 2715
I have a tab delimited file with many number of features. I want to delete least informative lines. To be specific I want to delete lines that all have question mark (?) in all the columns except the last which can have yes or a no. My file looks like
a b c frequent
? ? ? No
? ? 1 Yes
1 ? 1 No
? 1 1 Yes
? ? ? No
? ? ? Yes
I want to delete columns that have
? ? ? No
or
? ? ? Yes
I can use
sed '/pattern/d' ./ file
I However how do I use it for multiple copies of ?. There can be hundreds of column so solutions such as
sed '/? ? ? No/d' ./ file
and
sed '/? ? ? Yes/d' ./ file
will not work. I want my output to look like
a b c frequent
? ? 1 Yes
1 ? 1 No
? 1 1 Yes
EDIT 1: For columns in a tab delimited file with first column as serial number and last column as space delimited class labels. I want to consider second to second to last rows and remove columns that have all question marks.
No a b c itemname
1 ? ? ? frying pan
2 ? ? 1 t-shirt
3 1 ? 1 microwave oven
10 ? 1 1 forks and knives
11 ? ? ? gold
12 ? ? ? chain
The wanted output is
No a b c itemname
2 ? ? 1 t-shirt
3 1 ? 1 microwave oven
10 ? 1 1 forks and knives
Upvotes: 1
Views: 1391
Reputation: 40748
Regarding your last update, you can modify the solution of @Jotne as follows:
NR==1 {
p=NF-2
next
}
{
for (i=1;i<=p;i++) {
if (!( $(i+1)=="?")) f=1
}
}
f {
print
f=x
}
Upvotes: 0
Reputation: 42799
You can try this to handle both cases in one step
sed -r '/(\?\s+){3}(Yes|No)/d' ./file
EDIT:
Regarding the number of ?
per line, you can just replace {3}
with +
if you want "one or more" or use {3,}
if you want something like "3 or more", or you can use {3,5}
for example if you want to say "between 3 and 5"
EDIT2:
This is a grep alternative
egrep -v '(\?\s+){3}(Yes|No)' ./fileToTest > outputFile
Note:
The reason sed
wasn't working is because we need extended regex
after checking sed's help I found it's the flag -r
Upvotes: 3
Reputation: 3838
Using awk
:
[ ~]$ cat test.txt
a b c frequent
? ? ? No
? ? 1 Yes
1 ? 1 No
? 1 1 Yes
? ? ? No
? ? ? Yes
[ ~]$ awk '!($0 ~ "?\\ *?\\ *?\\ *(Yes|No)"){print}' test.txt
a b c frequent
? ? 1 Yes
1 ? 1 No
? 1 1 Yes
[ ~]$
You could also use egrep
like this :
[ ~]$ egrep -v "\?\ *\?\ *\?\ *(Yes|No)" test.txt
a b c frequent
? ? 1 Yes
1 ? 1 No
? 1 1 Yes
Upvotes: 0
Reputation: 41456
Escape the ?
sed '/\? +\? +\? +Yes/d' file
Since your file seems to be space separated with multiple space, you need +
Or if you have tab
sed '/\?\t\?\t\?\tNo/d' file
An awk
solution to delete lines that only have ?
awk '{for (i=1;i<NF;i++) if ($i!~"?") f=1} f {print;f=x}' file
Or using aragaers approach, print only lines with at least one 1
awk '/1/ || NR==1' file
a b c frequent
? ? 1 Yes
? ? 1 Yes
Upvotes: 3
Reputation: 17848
Is it guaranteed that column contains either ?
or 1
? If yes, simply delete everything unless it contains at least one 1
(and is not a first line):
sed -n '1p; /1/p;' file
Upvotes: 2