jelfman
jelfman

Reputation: 23

Replacing numerical values in a FASTA file with their index in a different file. (Bash preferred)

I have a folder full of fasta files with the following format (and so on), where the line beginning with > is the read name of DNA sequence, and the following line is the sequence itself. This pattern repeats for the entire file:

> 887_ENCFF899MTI.fastq.gz_seq1
GGCCCGCCTCCCGTCGGCCGGTGCGAGCGGCTCCGCGA
> 55_ENCFF899MTI.fastq.gz_seq2
GGGGGGGGCGTCTCGCGCAAACGTCCATAAC
> ...
...

In the read names, [887] corresponds to the index of a query sequence I used to find this read, stored in a different file (e.g. SequenceNames.txt). The other file can be assumed to have this format:

SequenceA
SequenceB
...

I want to replace only the number between > and _ (avoiding incidental matches with the filename) with the Sequence matching the index of that number from the SequenceNames file. For example, I would want

> 1_ENCFF899MTI.fastq.gz_seq1
ACTATC
> 2_ENCFF899MTI.fastq.gz_seq1

to become

> SequenceA_ENCFF899MTI.fastq.gz_seq1
> SequenceB_ENCFF899MTI.fastq.gz_seq1

I am able to make these replacements generally, but I'm really unsure of how to direct the index replacement specifically to the location/regex match between > and _ without performing a file-wide dictionary replacement of these numbers, and I'm struggling with awk array indexing to get something like

gawk '{print gensub(/^> ([0-9]*)_/,array[pattern],"\\1")}'

to produce what I'm looking for.

Upvotes: 2

Views: 106

Answers (3)

RavinderSingh13
RavinderSingh13

Reputation: 133770

In GNU awk with your shown samples please try following awk code.

awk '
FNR==NR{
  seqVal[FNR]=$0
  next
}
/^>/ && match($0,/(^> )[^_]*(_.*$)/,arr){
  print arr[1] seqVal[++count] arr[2]
}
' sequencenames.txt input.fasta

Upvotes: 2

jhnc
jhnc

Reputation: 16868

Assuming the desire is to replace the number in a line of the fasta file that starts > 887_ with content of the 887th line of SequenceNames.txt, then:

awk '
    # create lookup table
    NR==FNR { i2p[FNR] = $0; next }

    # try to change relevant lines
    $1==">" {

        # extract index
        idx = $2
        sub( /_.+/, "", idx )

        # try to replace (no change if no match)
        if (idx in i2p)
            sub( /^[^_]+/, "> "i2p[idx] )
    }

    # print all lines
    1
' SequenceNames.txt input.fasta >output.fasta

Upvotes: 1

Dave Pritlove
Dave Pritlove

Reputation: 2687

Using gawk:

awk 'NR==FNR{ar["> "NR"_"]=$0} 
NR>FNR{match($0,/^> [0-9]+_/,m); gsub(/^> [0-9]+_/, "> " ar[m[0]]"_", $0);print} ' SequenceNames.txt matches.fasta

The NR==FNR block collects line data from the sequence name file in an array indexed with a string built from ">", the line number, and the trailing "_" character.

The NR>FNR block stores the string matched to a regex requiring the line-start ">" followed by a space, a number, and the underscore in array m. Gsub is then used to replace the match with the corresponding value held in the sequence name array.

Tested using GNU Awk 5.1.0

Upvotes: 2

Related Questions