Reputation: 561
I have a genetics data like this:
MUT1 G_->_A_(het) 44%_(96)___[45%_(49)_/_43%_(47)] rs1799967_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.4956G>A 1
MUT1 A_->_G_(homo) 99%_(297)___[99%_(151)_/_99%_(146)] rs206075_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.4563A>G 1
MUT1 G_->_C_(homo) 100%_(259)___[100%_(132)_/_100%_(127)] COSM4147689_(COSMIC),_COSM4147690_(COSMIC),_rs206076_(Gene_file;_1000Genomes;_ClinVar;_ClinVarVCF;_dbSNP) c.6513G>C 2
MUT1 A_->_C_(het) 41%_(103)___[42%_(53)_/_40%_(50)] COSM3753646_(COSMIC),_COSM147663_(COSMIC),_rs144848_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.1114A>C 5
What I need to parse this data and extract only some kind of fields.
Require output is:
MUT1 het 44% rs1799967 c.4956G>A 1
MUT1 homo 99% rs206075c.4563A>G 1
MUT1 homo 100% rs206076 c.6513G>C 2
MUT1 het 41% rs144848 c.1114A>C 5
So output shoould be - all first column, from second column only het or hom, third column is only %, fifth column should be extracted only rs_number - this have always different position and last column.
note: I know, that info about homo/het is always in last filed of second column. And % is always on first field in third column.
My solution is:
awk -v OFS="\t" '{print $1,$5,$6,$9,$10,$11}' zkouska.csv | awk -v OFS="\t" 'NR>1{split($2,arr2,"_"); split($3,arr3,"_"); print $1,arr2[4],arr3[1],$4,$5,$6}'
But output is :
BRCA1 (het) 44% rs1799967_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.4956G>A 1
BRCA1 (homo) 99% rs206075_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.4563A>G 1
BRCA1 (homo) 100% COSM4147689_(COSMIC),_COSM4147690_(COSMIC),_rs206076_(Gene_file;_1000Genomes;_ClinVar;_ClinVarVCF;_dbSNP) c.6513G>C 2
BRCA1 (het) 41% COSM3753646_(COSMIC),_COSM147663_(COSMIC),_rs144848_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.1114A>C 5
BRCA1 (homo) 100% COSM148277_(COSMIC),_COSM3755561_(COSMIC),_rs16942_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.3548A>G 5
Still have problem to extract rs from fifth column. Erase quotes in second field. Inpout and output should be TAB separate. Solution could no be only in awk.
Upvotes: 3
Views: 390
Reputation: 85690
Am posting a pure bash
logic for your requirement.
#!/bin/bash
while read col1 col2 col3 col4 col5 col6
do
subcol2="${col2#*(}";subcol2=${subcol2%)*} # Extracting string within the braces '()' using parameter-expansion
[[ $col4 =~ .*rs([[:digit:]]+).* ]] && subcol4="${BASH_REMATCH[1]}" # RegEx in bash to extract number following the 'rs' string
printf "%s %s %s %s %s %s\n" "$col1" "$subcol2" "${col3%%_*}" "rs$subcol4" "$col5" "$col6"
done <file
On running the script, produces the result as
$ bash script.sh
MUT1 het 44% rs1799967 c.4956G>A 1
MUT1 homo 99% rs206075 c.4563A>G 1
MUT1 homo 100% rs206076 c.6513G>C 2
MUT1 het 41% rs144848 c.1114A>C 5
NOTE: The solution may perform slower on bigger files. I have tested this only on your sample file.
Upvotes: 3
Reputation: 47119
Using a combination of gsub
and match
might be the way to go, here is a portable example:
parse.awk
{
gsub(/^[^(]+\(|\)/, "", $2)
gsub(/_.*/, "", $3)
match($4, /rs[0-9]+/)
print $1, $2, $3, substr($4, RSTART, RLENGTH), $5, $6
}
Run it like this:
awk -f parse.awk OFS='\t' < infile
Output:
MUT1 het 44% rs1799967 c.4956G>A 1
MUT1 homo 99% rs206075 c.4563A>G 1
MUT1 homo 100% rs206076 c.6513G>C 2
MUT1 het 41% rs144848 c.1114A>C 5
Upvotes: 3
Reputation: 23677
$ perl -lne 'print join "\t", /^(\S+)/,/^[^(]+\(\K([^)]+)/,/^[^)]+\)\s+\K(\d+%)/,/(rs\d+)/,/(\S+\s+\S+)\s*$/' file
MUT1 het 44% rs1799967 c.4956G>A 1
MUT1 homo 99% rs206075 c.4563A>G 1
MUT1 homo 100% rs206076 c.6513G>C 2
MUT1 het 41% rs144848 c.1114A>C 5
/^(\S+)/
extract non-whitespace characters from start of line/^[^(]+\(\K([^)]+)/
extract characters between first ()
/^[^)]+\)\s+\K(\d+%)/
extract first match of digits followed by % after first )
in the line/(rs\d+)/
extract rs
followed by digits/(\S+\s+\S+)\s*$/
extract last two columns
Another way is to process each field separately, similar to bash
and awk
solutions
$ perl -lane '
$F[1] =~ s/.*\(|\)//g;
$F[2] =~ s/_.*//;
($F[3]) = $F[3] =~ m/(rs\d+)/;
print join "\t", @F;
' file
MUT1 het 44% rs1799967 c.4956G>A 1
MUT1 homo 99% rs206075 c.4563A>G 1
MUT1 homo 100% rs206076 c.6513G>C 2
MUT1 het 41% rs144848 c.1114A>C 5
Upvotes: 4