matching and appending strings to headers

Question

I want to append strings to sequence headers in a FASTA file.

Input:

>uce-101_seqname
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA

Desired output:

>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA

Example code:

awk -F ">" '{if($2 ~ /^uce/){print $0 " |" substr($2,1,7)} else {print $0}}'

The example code only works for 7 characters (e.g., uce-101). I need it to work for greater and less than 7 characters (e.g., uce-1, uce-10, uce-1001).

Jotne · Accepted Answer

This should do:

awk -F">|_" 'NF>2 {$0=$0" |"$2}1' file
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA

Set field separator to > or _
If line contains more than two field, recreate the line
Print all lines.

If you need to test for uce, then this should do:

awk -F">|_" '$2~/^uce/ {$0=$0" |"$2}1' file

matching and appending strings to headers

Input:

Desired output:

Example code:

Answers (2)

Related Questions