Reputation: 77
I've got a file with a subset of geneIDs in it, and a fasta file with all geneIDs and their sequences. For each gene in the subset file, I want to get positions 2-7 from the start of each fasta sequence. Ideally the output file would be 'pos 2-7' '\t' 'geneID'.
Example subset:
mmu-let-7g-5p MIMAT0000121
mmu-let-7i-5p MIMAT0000122
Fasta file:
>mmu-let-7g-5p MIMAT0000121
UGAGGUAGUAGUUUGUACAGUU
>mmu-let-7i-5p MIMAT0000122
UGAGGUAGUAGUUUGUGCUGUU
>mmu-let-7f-5p MIMAT0000525
UGAGGUAGUAGAUUGUAUAGUU
wanted output:
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
The first part (pulling out fasta sequences for the subset of genes) I've done using grep -w -A 1 -f
. Not sure how to get pos 2-7 and make the output look like that now using Bash.
Upvotes: 2
Views: 1028
Reputation: 23667
Tested with GNU awk
, but I think it will work with any awk
:
$ awk 'NR==FNR{a[$0]; next}
$1 in a{print substr($2, 2, 6), $1}
' gene.txt RS='>' FS='\n' OFS='\t' fasta.txt
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
NR==FNR{a[$0]; next}
build array with each line content as key for the first file passed to awk
RS='>' FS='\n' OFS='\t'
these will set input record separator to >
, input field separator to newline and output field separator to tab only for the second file (since these variables are assigned after the first filename)$1 in a{print substr($2, 2, 7), $1}
if first field exists as a key in array a
, print the required detailsIf there can be trailing whitespace characters at the end of lines, use:
$ awk 'NR==FNR{sub(/[[:space:]]+$/, ""); a[$0]; next}
$1 in a{print substr($2, 2, 6), $1}
' gene.txt RS='>' FS='[[:space:]]*\n' OFS='\t' fasta.txt
Upvotes: 6
Reputation: 37404
Another awk:
$ awk '
{
gsub(/ +$/,"") # clean trailing space from sample data
}
NR==FNR { # process subset file as it is smaller
a[$0] # hash keys
next
} # process fasta file
/^>/ && ((p=substr($0,2)) in a) { # if string found in hash
if(getline>0) # read next record
print substr($0,2,6),p # and print
}' subset fasta
Output:
GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122
Upvotes: 2
Reputation: 133458
Could you please try following, written and tested with shown samples only in GNU awk
.
awk '
FNR==NR{
a[$1]=$2
next
}
/^>/{
ind=substr($1,2)
}
/^>/ && (ind in a){
found=1
val=ind OFS a[ind]
next
}
found{
print substr($0,2,6) OFS val
val=found=""
}
' gene fastafile
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when gene Input_file is being read.
a[$1]=$2 ##Creating array a with index of $1 and value of $2 here.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
ind=substr($1,2) ##Creating ind which has substring from 2nd charcters to all values of first field.
}
/^>/ && (ind in a){ ##Checking if line starts with > and ind is present in array a then do following.
found=1 ##Setting found to 1 here.
val=ind OFS a[ind] ##Creating val which has ind OFS and value of a with index of ind.
next ##next will skip all further statements from here.
}
found{ ##Checking condition if found is NOT NULL then do following.
print substr($0,2,6) OFS val ##Printing sub string from 2nd to 7th character OFS and val here.
val=found="" ##Nullifying val and found here.
}
' gene fastafile ##Mentioning Input_file names here.
Upvotes: 5