user1819854
user1819854

Reputation: 69

Cutting a specific column from a space delimited file

I run a hmmscan analysis using a FASTA file asking for tabular output format with --tblout option, which is deliberately space-delimited (rather than tab-delimited) and justified into aligned columns.

The file looks like this (this is just a format example)

targetname accession queryname    accession  e-value score bias
x_x_x      PFyyyy.y  ContigXXX_0  -          x.xe-xx yy.y  x.x
x          PFyyyy.yy COntigXXX_1  -          xe-x    yy.y  x.x
x_x        PFyyyy.y  COntigXXX_2  -          xe-xx    y.y  x.x
x_x_x      PFyyyy.yy COntigXXX_3  -          x.xe-x  yy.y  x.x
.
..

where target name are for example: Methyltransf or Dimer_tnp_hAT or Nucleotide_trans

where accession are for example: PF13847.1 or PF03407.11 or PF01958.13;

where query name are for example: Contig244_1 or Contig44245_3 or Contig12345_6

where the second accession column is: -

where e.value are for example: 4.0e-10 or 3.5e-15, etc..

and score and bias are numbers in this format: xx.x

What I'd like to do is to cut the queryname column where all the ContigXXX_X with significant hits to protein domains are.

After this I'll be able to sort them and keep only the first occurence of each Contig and I can compare the file with the results from BlastP and BlastX (where I was already able to get the list of my Contigs that have hits to nr database)

So my question is: How can I cut the column where all my Contigs are? I've been try with grep,sed,cut commands but I haven't found the right one yet.

I'm new to Unix language and I'm still learning so every suggestions will be really appreciate.

And if my question is not clear just tell me, I can modify it!

Upvotes: 1

Views: 3731

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 753595

Superficially, if you have GNU cut, you can use:

cut -i -f 3 tblout-file

The -i option means that one or blanks will be taken as the field delimiters (whereas without it, each blank is a field delimiter). Non-GNU versions of cut typically do not support -i. (You can check whether it is GNU cut by running cut --version; if you get a meaningful version number, it's (probably) GNU, and if you get invalid option messages, it isn't.)

Does that not work for you? Obviously, you substitute the name of the file you created for tblout-file.

If there's a problem (such as not having GNU cut), then consider awk instead:

awk '{print $3}' tblout-file

Both these include the first line in the output too; there are multiple possible ways of removing the first line.

cut -i -f 3 tblout-file | sed 1d
awk 'NR>1 { print $3 }' tblout-file

Upvotes: 1

Vijay
Vijay

Reputation: 67211

awk 'NR!=1{print $3}' your_file

or

perl -F -lane 'if($.!=1){print $F[2]}' your_file

Upvotes: 1

Related Questions