Reputation: 607
I've seen variations of this question which helped me craft an initial guess, mostly involving doing two splits of a column in awk.
Here is an example line of my input:
chr1 Cufflinks transcript 470971 471355 1000 + . gene_id "ENSG00000236679.2"; transcript_id "ENST00000458203.2"; FPKM "0.0792422960"; frac "1.000000"; conf_lo "179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000"; conf_hi "0.000000"; cov "0.233090"; full_read_support "yes";
(Yeah, conf_lo is a weird number, but it's a bug in the program used to generate this.)
It is tab-delimited, but one field ($9) is also semicolon and space delimited into key-value pairs. I want to use awk to filter for FPKM values (3 of $9) that are greater than 0, which involves two splits. If the filter passes, it should print a re-arrangement of the whole line. This is my best guess so far:
awk -F"\t" 'BEGIN {
OFS="\t";
split($9,t,";");
split(t[3],t3,"\"");
if (t3[2]>0.0) {
print $1,$4,$5,$9,$6,$7;}
}' transcripts.gtf > $input.bed
This is probably just a simple misunderstanding somewhere, but I'm not sure what I'm doing wrong.
Thanks for any help.
Upvotes: 1
Views: 34
Reputation: 77155
You've got most of it right except that you have written entire script in your BEGIN
block which means nothing was printed to the screen as no line is read inside of BEGIN
block.
Try this:
awk '
BEGIN { FS = OFS = "\t" }
{
split ($9, t, ";");
split (t[3], t3, "\"");
if (t3[2]>0.0) {
print $1, $4, $5, $9, $6, $7
}
}' transcripts.gtf > $input.bed
Having said that you don't need the second split
. Use the gsub
function to remove everything except numbers.
awk '
BEGIN { FS = OFS = "\t" }
{
split ($9, t, ";");
gsub (/[^.[:digit:]]+/, "", t[3]);
if (t[3] > 0) {
print $1, $4, $5, $9, $6, $7
}
}' transcripts.gtf > $input.bed
You can add -
inside the character class ([^.[:digit:]-]
) if your values can be negative.
Upvotes: 2