sian
sian

Reputation: 77

Extract positions 2-7 from a fasta sequence for each gene using Bash

I've got a file with a subset of geneIDs in it, and a fasta file with all geneIDs and their sequences. For each gene in the subset file, I want to get positions 2-7 from the start of each fasta sequence. Ideally the output file would be 'pos 2-7' '\t' 'geneID'.

Example subset:

mmu-let-7g-5p MIMAT0000121  
mmu-let-7i-5p MIMAT0000122 

Fasta file:

>mmu-let-7g-5p MIMAT0000121 
UGAGGUAGUAGUUUGUACAGUU
>mmu-let-7i-5p MIMAT0000122 
UGAGGUAGUAGUUUGUGCUGUU
>mmu-let-7f-5p MIMAT0000525 
UGAGGUAGUAGAUUGUAUAGUU

wanted output:

GAGGUA   mmu-let-7g-5p MIMAT0000121
GAGGUA   mmu-let-7i-5p MIMAT0000122

The first part (pulling out fasta sequences for the subset of genes) I've done using grep -w -A 1 -f. Not sure how to get pos 2-7 and make the output look like that now using Bash.

Upvotes: 2

Views: 1028

Answers (3)

Sundeep
Sundeep

Reputation: 23667

Tested with GNU awk, but I think it will work with any awk:

$ awk 'NR==FNR{a[$0]; next}
       $1 in a{print substr($2, 2, 6), $1}
      ' gene.txt RS='>' FS='\n' OFS='\t' fasta.txt
GAGGUA  mmu-let-7g-5p MIMAT0000121
GAGGUA  mmu-let-7i-5p MIMAT0000122
  • NR==FNR{a[$0]; next} build array with each line content as key for the first file passed to awk
  • RS='>' FS='\n' OFS='\t' these will set input record separator to >, input field separator to newline and output field separator to tab only for the second file (since these variables are assigned after the first filename)
  • $1 in a{print substr($2, 2, 7), $1} if first field exists as a key in array a, print the required details

If there can be trailing whitespace characters at the end of lines, use:

$ awk 'NR==FNR{sub(/[[:space:]]+$/, ""); a[$0]; next}
       $1 in a{print substr($2, 2, 6), $1}
      ' gene.txt RS='>' FS='[[:space:]]*\n' OFS='\t' fasta.txt

Upvotes: 6

James Brown
James Brown

Reputation: 37404

Another awk:

$ awk '
{
    gsub(/ +$/,"")                 # clean trailing space from sample data 
}
NR==FNR {                          # process subset file as it is smaller
    a[$0]                          # hash keys
    next                        
}                                  # process fasta file
/^>/ && ((p=substr($0,2)) in a) {  # if string found in hash
    if(getline>0)                  # read next record
        print substr($0,2,6),p     # and print
}' subset fasta

Output:

GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try following, written and tested with shown samples only in GNU awk.

awk '
FNR==NR{
  a[$1]=$2
  next
}
/^>/{
  ind=substr($1,2)
}
/^>/ && (ind in a){
  found=1
  val=ind OFS a[ind]
  next
}
found{
  print substr($0,2,6) OFS val
  val=found=""
}
' gene fastafile

Explanation: Adding detailed explanation for above.

awk '                               ##Starting awk program from here.
FNR==NR{                            ##Checking condition FNR==NR which will be TRUE when gene Input_file is being read.
  a[$1]=$2                          ##Creating array a with index of $1 and value of $2 here.
  next                              ##next will skip all further statements from here.
}
/^>/{                               ##Checking condition if line starts from > then do following.
  ind=substr($1,2)                  ##Creating ind which has substring from 2nd charcters to all values of first field.
}
/^>/ && (ind in a){                 ##Checking if line starts with > and ind is present in array a then do following.
  found=1                           ##Setting found to 1 here.
  val=ind OFS a[ind]                ##Creating val which has ind OFS and value of a with index of ind.
  next                              ##next will skip all further statements from here.
}
found{                              ##Checking condition if found is NOT NULL then do following.
  print substr($0,2,6) OFS val      ##Printing sub string from 2nd to 7th character OFS and val here.
  val=found=""                      ##Nullifying val and found here.
}
' gene fastafile                    ##Mentioning Input_file names here.

Upvotes: 5

Related Questions