Alberto Pascale
Alberto Pascale

Reputation: 61

replace header in a file (fasta) if match with another file.txt

I have a file Pseudo.fasta that looks like this

>Pseudomonas_brassicacearum_51MFCVI2.1_ABFDHLDI_02438
AATCGCAATTTGCCCAAA
>Pseudomonas_brassicacearum_51MFCVI2.1_ABFDHLDI_03705
GATCCTTAACGGA
>Pseudomonas_brassicacearum_PP1_210F_EGEGDKLG_01471
AGGCCTTAAACCTT

and another file.txt with two columns: 1 column that partially match the header of the .fasta file and a second column and the file looks like this.

Pseudomonas_brassicacearum_51MFCVI2.1   JW5VryPcbM
Pseudomonas_brassicacearum_51MFCVI2.1   JW5VryPcbM
Pseudomonas_brassicacearum_PP1_210F     nxUvzhi39L

Basically I want to replace the header with the second column of the second file if the first column of the first file match (the first part) of the fasta header.

The deisred output should look like this

>JW5VryPcbM_1
AATCGCAATTTGCCCAAA
>JW5VryPcbM_2
GATCCTTAACGGA
>nxUvzhi39L_1
AGGCCTTAAACCTT

I was trying to do that with awk

awk -F "\t" 'FNR==NR {f2[$1]=$2;next} $2 in f2 {$2=f2[$2]}1' file.txt FS='>' OFS='>' Pseudomo.fasta 

but this solution only works if the the strings to match are exactly the same.

After this I would apply this awk line to add a number in case of duplicates in the headers

awk '{print $0 (/^>/ ? "_" (++c[$1]) : "")}' Pseudo.fasta

It would be also cool to maybe directly pipe this last with the previous command. Any suggestion? Thanks!

Upvotes: 2

Views: 983

Answers (3)

markp-fuso
markp-fuso

Reputation: 34254

Other fasta related questions occasionally show other text on the header line; modifying the 1st line of OP's sample input to demonstrate:

$ cat Pseudo.fasta
>Pseudomonas_brassicacearum_51MFCVI2.1_ABFDHLDI_02438 keep the rest of this text
AATCGCAATTTGCCCAAA
>Pseudomonas_brassicacearum_51MFCVI2.1_ABFDHLDI_03705
GATCCTTAACGGA
>Pseudomonas_brassicacearum_PP1_210F_EGEGDKLG_01471
AGGCCTTAAACCTT

If the objective is to replace just the 1st (space delimited) field in the header record, while leaving other text in place, one awk idea:

awk '
FNR==NR { a[$1]=$2; next }
/^>/    { gene=substr($1,2)
          for (i in a)
              if (gene ~ i) { $1=">" a[i] "_" ++cnt[i]; break }
        }
1
' replacements.txt Pseudo.fasta

NOTE: if the header record has fields delimited by something other than white space (eg, pipes, semicolons) then a few small edits could be made so this code could work with a different delimiter

This generates:

>JW5VryPcbM_1 keep the rest of this text
AATCGCAATTTGCCCAAA
>JW5VryPcbM_2
GATCCTTAACGGA
>nxUvzhi39L_1
AGGCCTTAAACCTT

Upvotes: 2

anubhava
anubhava

Reputation: 785108

You may use this awk:

awk '
NR == FNR {
   map[">" $1] = $2
   next
}
sub(/(_[^_]+){2}$/, "") && $0 in map {
   $0 = ">" map[$0] "_" ++freq[map[$0]]
} 1' file.txt Pseudo.fasta

>JW5VryPcbM_1
AATCGCAATTTGCCCAAA
>JW5VryPcbM_2
GATCCTTAACGGA
>nxUvzhi39L_1
AGGCCTTAAACCTT

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163287

In your code you use $2 in f2 checking for a key, but you need a partial match instead.

To get a partial match, you can loop the array f2 that you are using, and then for example using another variant with index() and directly printing when there is partial match.

Then use next to go to the next record.

If there is no match, the 1 in }1 at the end will print the line by default.

awk '
FNR==NR {f2[$1]=$2;next}
/^>/ {
  for (i in f2) {
    if (index(substr($1,2), i)) {
      print ">"f2[i]; next
    }
  }
}1' file.txt Pseudo.fasta 

Output

>JW5VryPcbM
AATCGCAATTTGCCCAAA
>JW5VryPcbM
GATCCTTAACGGA
>nxUvzhi39L
AGGCCTTAAACCTT

Upvotes: 1

Related Questions