Reputation: 101
i have a fastq file with more than 100 million reads in it and a genome sequence of 10000 in length
i want to take out the sequences from the fastq file and search in the genome sequence with allowing 3 mismatches
I tried in this way using awk i got the sequences from fastq file:
1.fq(few lines)
@DH1DQQN1:269:C1UKCACXX:1:1101:1207:2171 1:N:0:TTAGGC NATCCCCATCCTCTGCTTGCTTTTCGGGATATGTTGTAGGATTCTCAGC
+
1=ADBDDHD;F>GF@FFEFGGGIAEEI?D9DDHHIGAAF:BG39?BB
@DH1DQQN1:269:C1UKCACXX:1:1101:1095:2217 1:N:0:TTAGGC TAGGATTTCAAATGGGTCGAGGTGGTCCGTTAGGTATAGGGGCAACAGG
+
??AABDD4C:DDDI+C:C3@:C):1?*):?)?################
$ awk 'NR%4==2' 1.fq
NATCCCCATCCTCTGCTTGCTTTTCGGGATATGTTGTAGGATTCTCAGC TAGGATTTCAAATGGGTCGAGGTGGTCCGTTAGGTATAGGGGCAACAGG
i have all the sequences in file,now i want to take each line of sequence and search in genome sequence with allowing 3 mismatches and if it finds print the sequences
example:
genome sequence file:
GGGGAGGAATATGATTTACAGTTTATTTTTCAACTGTGCAAAATAACCTTAACTGCAGACGTTATGACATACATACATTCTATGAATTCCACTATTTTGGAGGACTGGAATTTTGGTCTACAACCTCCCCCAGGAGGCACACTAGAAGATACTTATAGGTTTGTAACCCAGGCAATTGCTTGTCAAAAACATACA
search sequence file:
GGGGAGGAATATGAT
GGGGAGGAATATGAA
GGGGAGGAATATGCC
TCAAAAACATAGG
TCAAAAACATGGG
OUTPUT FILE:
GGGGAGGAATATGAT 0 # 0 mismatch exact sequence
GGGGAGGAATATGAA 1 # 1 mismatch
GGGGAGGAATATGCC 2 # 2 mismatch
TCAAAAACATAGG 2 # 2 mismatch
TCAAAAACATGGG 3 # 3 mismatch
Upvotes: 1
Views: 1189
Reputation: 870
I think you should consider using an alignment tool designed for this data for a couple of reasons:
For those reasons, and others, any script you come up with will likely not be near as fast and complete as the tools that already exist. If you want to specify the number of mismatches to keep, instead of aligning all your reads and then parsing the output, you could use Vmatch if you have access to it (this tool is very fast and good for many matching tasks).
Upvotes: 0
Reputation: 63902
something like?
use 5.012;
use strict;
use warnings;
use String::Approx qw(aslice);
use File::Slurp;
use Data::Dumper;
my $genseq = "gseq.txt"; #the long sequence
$_ = read_file($genseq);
#read small patterns from stdin
while(my $patt = <>) {
chomp $patt;
my $len = length($patt);
my($index, $size, $distance) = aslice($patt, ["3D0S3", "minimal_distance"]);
say "$patt matched approx. at $index with mismatch $distance" if $distance <= 3;
}
for you input produces:
GGGGAGGAATATGAT matched approx. at 0 with mismatch 0
GGGGAGGAATATGAA matched approx. at 0 with mismatch 1
GGGGAGGAATATGCC matched approx. at 0 with mismatch 2
TCAAAAACATAGG matched approx. at 179 with mismatch 2
TCAAAAACATGGG matched approx. at 179 with mismatch 3
Honestly, haven't idea how will work with an 10000 chars long genseq...
Upvotes: 2