Reputation: 69
I run a hmmscan analysis using a FASTA file asking for tabular output format with --tblout option, which is deliberately space-delimited (rather than tab-delimited) and justified into aligned columns.
The file looks like this (this is just a format example)
targetname accession queryname accession e-value score bias
x_x_x PFyyyy.y ContigXXX_0 - x.xe-xx yy.y x.x
x PFyyyy.yy COntigXXX_1 - xe-x yy.y x.x
x_x PFyyyy.y COntigXXX_2 - xe-xx y.y x.x
x_x_x PFyyyy.yy COntigXXX_3 - x.xe-x yy.y x.x
.
..
where target name are for example: Methyltransf or Dimer_tnp_hAT or Nucleotide_trans
where accession are for example: PF13847.1 or PF03407.11 or PF01958.13;
where query name are for example: Contig244_1 or Contig44245_3 or Contig12345_6
where the second accession column is: -
where e.value are for example: 4.0e-10 or 3.5e-15, etc..
and score and bias are numbers in this format: xx.x
What I'd like to do is to cut the queryname column where all the ContigXXX_X with significant hits to protein domains are.
After this I'll be able to sort them and keep only the first occurence of each Contig and I can compare the file with the results from BlastP and BlastX (where I was already able to get the list of my Contigs that have hits to nr database)
So my question is: How can I cut the column where all my Contigs are? I've been try with grep,sed,cut commands but I haven't found the right one yet.
I'm new to Unix language and I'm still learning so every suggestions will be really appreciate.
And if my question is not clear just tell me, I can modify it!
Upvotes: 1
Views: 3731
Reputation: 753595
Superficially, if you have GNU cut
, you can use:
cut -i -f 3 tblout-file
The -i
option means that one or blanks will be taken as the field delimiters (whereas without it, each blank is a field delimiter). Non-GNU versions of cut
typically do not support -i
. (You can check whether it is GNU cut
by running cut --version
; if you get a meaningful version number, it's (probably) GNU, and if you get invalid option messages, it isn't.)
Does that not work for you? Obviously, you substitute the name of the file you created for tblout-file
.
If there's a problem (such as not having GNU cut
), then consider awk
instead:
awk '{print $3}' tblout-file
Both these include the first line in the output too; there are multiple possible ways of removing the first line.
cut -i -f 3 tblout-file | sed 1d
awk 'NR>1 { print $3 }' tblout-file
Upvotes: 1
Reputation: 67211
awk 'NR!=1{print $3}' your_file
or
perl -F -lane 'if($.!=1){print $F[2]}' your_file
Upvotes: 1