user4400727
user4400727

Reputation:

matching and appending strings to headers

I want to append strings to sequence headers in a FASTA file.

Input:

>uce-101_seqname
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA

Desired output:

>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA

Example code:

awk -F ">" '{if($2 ~ /^uce/){print $0 " |" substr($2,1,7)} else {print $0}}' <inputfile>

The example code only works for 7 characters (e.g., uce-101). I need it to work for greater and less than 7 characters (e.g., uce-1, uce-10, uce-1001).

Upvotes: 1

Views: 333

Answers (2)

Jotne
Jotne

Reputation: 41456

This should do:

awk -F">|_" 'NF>2 {$0=$0" |"$2}1' file
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA

Set field separator to > or _
If line contains more than two field, recreate the line
Print all lines.

If you need to test for uce, then this should do:

awk -F">|_" '$2~/^uce/ {$0=$0" |"$2}1' file

Upvotes: 2

Steve
Steve

Reputation: 54402

I think shellter has hit the nail on the head with his comment above. With that, your line of could be reduced to:

awk -F '>' '$2~/^uce/ { x=$2; sub(/_.*/,"",x); print $0, "|" x; next }1' file

Results:

>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA

However, if you'd prefer a solution, you could try:

sed '/^>uce/s/>\([^_]*\).*/& |\1/' file

Results:

>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA

Explanation:

/^>uce/      # This is an address that specifies which lines are to be
             # examined or modified. In this case, only lines beginning
             # the string 'uce' are to be addressed.

s/../../     # Perform a substitution using the '/' delimiter

>\([^_]*\).* # This is the pattern to be matched. The '>' character is a
             # literal '>'. Escaped parentheses are then used to capture
             # a character class that says any character not an
             # underscore any (zero or more) number of times. All this
             # is then followed by any character any number of times.

& |\1        # This is the replacement string. The '&' character is the
             # whole pattern that was found. This is followed by a
             # literal space and a literal pipe character. '\1' is then
             # our pattern that we kept using our escaped parentheses.

Upvotes: 3

Related Questions