Reputation: 61
I have a file Pseudo.fasta that looks like this
>Pseudomonas_brassicacearum_51MFCVI2.1_ABFDHLDI_02438
AATCGCAATTTGCCCAAA
>Pseudomonas_brassicacearum_51MFCVI2.1_ABFDHLDI_03705
GATCCTTAACGGA
>Pseudomonas_brassicacearum_PP1_210F_EGEGDKLG_01471
AGGCCTTAAACCTT
and another file.txt with two columns: 1 column that partially match the header of the .fasta file and a second column and the file looks like this.
Pseudomonas_brassicacearum_51MFCVI2.1 JW5VryPcbM
Pseudomonas_brassicacearum_51MFCVI2.1 JW5VryPcbM
Pseudomonas_brassicacearum_PP1_210F nxUvzhi39L
Basically I want to replace the header with the second column of the second file if the first column of the first file match (the first part) of the fasta header.
The deisred output should look like this
>JW5VryPcbM_1
AATCGCAATTTGCCCAAA
>JW5VryPcbM_2
GATCCTTAACGGA
>nxUvzhi39L_1
AGGCCTTAAACCTT
I was trying to do that with awk
awk -F "\t" 'FNR==NR {f2[$1]=$2;next} $2 in f2 {$2=f2[$2]}1' file.txt FS='>' OFS='>' Pseudomo.fasta
but this solution only works if the the strings to match are exactly the same.
After this I would apply this awk line to add a number in case of duplicates in the headers
awk '{print $0 (/^>/ ? "_" (++c[$1]) : "")}' Pseudo.fasta
It would be also cool to maybe directly pipe this last with the previous command. Any suggestion? Thanks!
Upvotes: 2
Views: 983
Reputation: 34254
Other fasta
related questions occasionally show other text on the header line; modifying the 1st line of OP's sample input to demonstrate:
$ cat Pseudo.fasta
>Pseudomonas_brassicacearum_51MFCVI2.1_ABFDHLDI_02438 keep the rest of this text
AATCGCAATTTGCCCAAA
>Pseudomonas_brassicacearum_51MFCVI2.1_ABFDHLDI_03705
GATCCTTAACGGA
>Pseudomonas_brassicacearum_PP1_210F_EGEGDKLG_01471
AGGCCTTAAACCTT
If the objective is to replace just the 1st (space delimited) field in the header record, while leaving other text in place, one awk
idea:
awk '
FNR==NR { a[$1]=$2; next }
/^>/ { gene=substr($1,2)
for (i in a)
if (gene ~ i) { $1=">" a[i] "_" ++cnt[i]; break }
}
1
' replacements.txt Pseudo.fasta
NOTE: if the header record has fields delimited by something other than white space (eg, pipes, semicolons) then a few small edits could be made so this code could work with a different delimiter
This generates:
>JW5VryPcbM_1 keep the rest of this text
AATCGCAATTTGCCCAAA
>JW5VryPcbM_2
GATCCTTAACGGA
>nxUvzhi39L_1
AGGCCTTAAACCTT
Upvotes: 2
Reputation: 785108
You may use this awk
:
awk '
NR == FNR {
map[">" $1] = $2
next
}
sub(/(_[^_]+){2}$/, "") && $0 in map {
$0 = ">" map[$0] "_" ++freq[map[$0]]
} 1' file.txt Pseudo.fasta
>JW5VryPcbM_1
AATCGCAATTTGCCCAAA
>JW5VryPcbM_2
GATCCTTAACGGA
>nxUvzhi39L_1
AGGCCTTAAACCTT
Upvotes: 2
Reputation: 163287
In your code you use $2 in f2
checking for a key, but you need a partial match instead.
To get a partial match, you can loop the array f2
that you are using, and then for example using another variant with index() and directly printing when there is partial match.
Then use next to go to the next record.
If there is no match, the 1
in }1
at the end will print the line by default.
awk '
FNR==NR {f2[$1]=$2;next}
/^>/ {
for (i in f2) {
if (index(substr($1,2), i)) {
print ">"f2[i]; next
}
}
}1' file.txt Pseudo.fasta
Output
>JW5VryPcbM
AATCGCAATTTGCCCAAA
>JW5VryPcbM
GATCCTTAACGGA
>nxUvzhi39L
AGGCCTTAAACCTT
Upvotes: 1