Reputation: 704
I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1 The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct.
for f in htseqoutput*.sam.sam
do
awk ????? "$f" > “out${f#htseqoutput}”
done
input example
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 16 chr22 39715068 24 51M * 0 0 GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T31G0 YT:Z:UU XF:Z:SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16 chr19 4724687 40 33M * 0 0 AGGCGAATGTGATAACCGCTACACTAAGGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:26C6 YT:Z:UU XF:Z:tRNA
TCGACTCCCGGTGTGGGAACC_0 16 chr13 45492060 23 21M * 0 0 GGTTCCCACACCGGGAGTCGA IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C20 YT:Z:UU XF:Z:tRNA
output 1:
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
Upvotes: 0
Views: 23
Reputation: 74596
Seems like you could just use sed for this:
sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file
This captures the part at the start of the line and the alphanumeric part following XF:Z:
and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z:
string.
Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. "
should be used, not “
/”
.
Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator:
awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file
This optionally adds the XF:Z:
part to the field separator, so that it is removed from the start of the last field.
Upvotes: 2
Reputation: 8164
You can try, if column with "XF:Z:" is always at the end
awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam
you get,
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43 GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA TCGACTCCCGGTGTGGGAACC_0 tRNA
or, if this column is a variable position for each file
awk 'BEGIN{OFS="\t"}
FNR==1{
for(i=1;i<=NF;i++){
if($i ~ /^XF:Z:/) break
}
}
{n=split($i,a,":"); print $1, a[n]}' file.sam
Upvotes: 1