Reputation: 3334
I would like to find specific string and combinations of string in one column. Could you help me please?
INPUT:
benign,likely_pathogenic
benign,likely_pathogenic
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
uncertain_significance,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
uncertain_significance,conflicting_interpretations_of_pathogenicity,likely_benign
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
pathogenic
OUTPUT:
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
benign,likely_pathogenic
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
pathogenic
I would like to separate every column which contain pathogenic and likely_pathogenic. But part of string pathogenic is in conflicting_interpretations_of_pathogenicity. I tried
awk -F'\t' -v OFS="\t" '{if($14=="pathogenic") print FILENAME,$0; else if($14=="likely_pathogenic") print FILENAME,$0}'
but it is for exact string in column
If I tried:
awk -F'\t' -v OFS="\t" '{if($14~"pathogenic") print FILENAME,$0}'
I get all rows with pathogenic, likely_pathogenic and conflicting_interpretations_of_pathogenicity. In one row could be combination of conflicting... and pathogenic or likely_pathogenic.
Upvotes: 2
Views: 91
Reputation: 3975
$ grep -E '(likely_)?pathogenic\b' file
$ sed -En '/(likely_)?pathogenic\b/p' file
$ awk '/(likely_)?pathogenic\y/' file
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
benign,likely_pathogenic
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
pathogenic
Upvotes: 1
Reputation: 2815
the best case (it's not complete) i could quickly get to without using word boundary regex :
echo "${input….}" | mawk '$!(NF=NF)~ /pathogenic/' \ FS='[^,]*pathogenic[[:alpha:]][^,]*' OFS=
1 benign,likely_pathogenic
2 benign,likely_pathogenic
3 risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
4 risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
5 risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,
6 pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
7 benign,likely_pathogenic
8 benign,likely_pathogenic
9 ,_other,benign,pathogenic,likely_benign,
10 ,_other,benign,pathogenic,likely_benign,
11 risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,
12 pathogenic,likely_pathogenic
13 pathogenic
it might be deleteing too much stuff around rows 9-10
Upvotes: 1
Reputation: 36360
I would exploit GNU AWK
's word boundary for this task as follows, let file.txt
content be
benign,likely_pathogenic
benign,likely_pathogenic
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
uncertain_significance,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
uncertain_significance,conflicting_interpretations_of_pathogenicity,likely_benign
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
pathogenic
then
/pathogenic\y/{print}
gives output
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
benign,likely_pathogenic
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
pathogenic
Explanation: word boundary (\y
) is zero-length assertion, it can be placed before, after or before and after, first gives word starting with, second word ending with, third whole words. So pathogen\y
mean words ending with pathogen
. GNU AWK
define word as sequence of one or more letters, digits or underscores. Note: output is slighty different from desired shown as it does 4th risk_factor line, but it is compliant with description as that line holds ,pathogenic,
(tested in gawk 4.2.1)
Upvotes: 2
Reputation: 37394
Something like this, maybe:
awk '{
split($0,a,/,/) # split NEEDED field on commas
for(i in a) # check each part
if(a[i]~/^(likely_)?pathogenic$/) { # if matches this regex
print # output
break # no need for more matches
}
}' file
Some output:
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
...
Obviously you need to add FS
etc. as in your sample code you were processing NF==14
.
Edit:
I guess this would work too for the posted sample data:
$ awk '/(^|,)(likely_)?pathogenic(,|$)/' file
or for your assumed data:
$ awk '$14~/(^|,)(likely_)?pathogenic(,|$)/' file
Upvotes: 3