BioMan
BioMan

Reputation: 704

take out specific columns from mulitple files

I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1 The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct.

for f in htseqoutput*.sam.sam
do
awk ????? "$f" > “out${f#htseqoutput}”
done

input example

AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11  16  chr22   39715068    24  51M *   0   0   GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:18T31G0    YT:Z:UU XF:Z:SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16  chr19   4724687 40  33M *   0   0   AGGCGAATGTGATAACCGCTACACTAAGGAAAC   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII   AS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:26C6   YT:Z:UU XF:Z:tRNA
TCGACTCCCGGTGTGGGAACC_0 16  chr13   45492060    23  21M *   0   0   GGTTCCCACACCGGGAGTCGA   IIIIIIIIIIIIIIIIIIIII   AS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0C20   YT:Z:UU XF:Z:tRNA

output 1:

AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11   SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA

Upvotes: 0

Views: 23

Answers (2)

Tom Fenech
Tom Fenech

Reputation: 74596

Seems like you could just use sed for this:

sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file

This captures the part at the start of the line and the alphanumeric part following XF:Z: and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z: string.

Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. " should be used, not /.

Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator:

awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file

This optionally adds the XF:Z: part to the field separator, so that it is removed from the start of the last field.

Upvotes: 2

Jose Ricardo Bustos M.
Jose Ricardo Bustos M.

Reputation: 8164

You can try, if column with "XF:Z:" is always at the end

awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam

you get,

AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11  SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA

or, if this column is a variable position for each file

awk 'BEGIN{OFS="\t"}
     FNR==1{
       for(i=1;i<=NF;i++){
         if($i ~ /^XF:Z:/) break
       }
     }
     {n=split($i,a,":"); print $1, a[n]}' file.sam

Upvotes: 1

Related Questions