abh
abh

Reputation: 101

search sequence in genome with mismatches

i have a fastq file with more than 100 million reads in it and a genome sequence of 10000 in length

i want to take out the sequences from the fastq file and search in the genome sequence with allowing 3 mismatches

I tried in this way using awk i got the sequences from fastq file:

1.fq(few lines)

@DH1DQQN1:269:C1UKCACXX:1:1101:1207:2171 1:N:0:TTAGGC NATCCCCATCCTCTGCTTGCTTTTCGGGATATGTTGTAGGATTCTCAGC

+

1=ADBDDHD;F>GF@FFEFGGGIAEEI?D9DDHHIGAAF:BG39?BB

@DH1DQQN1:269:C1UKCACXX:1:1101:1095:2217 1:N:0:TTAGGC TAGGATTTCAAATGGGTCGAGGTGGTCCGTTAGGTATAGGGGCAACAGG

+

??AABDD4C:DDDI+C:C3@:C):1?*):?)?################

$ awk 'NR%4==2' 1.fq

NATCCCCATCCTCTGCTTGCTTTTCGGGATATGTTGTAGGATTCTCAGC TAGGATTTCAAATGGGTCGAGGTGGTCCGTTAGGTATAGGGGCAACAGG

i have all the sequences in file,now i want to take each line of sequence and search in genome sequence with allowing 3 mismatches and if it finds print the sequences

example:

genome sequence file:

GGGGAGGAATATGATTTACAGTTTATTTTTCAACTGTGCAAAATAACCTTAACTGCAGACGTTATGACATACATACATTCTATGAATTCCACTATTTTGGAGGACTGGAATTTTGGTCTACAACCTCCCCCAGGAGGCACACTAGAAGATACTTATAGGTTTGTAACCCAGGCAATTGCTTGTCAAAAACATACA

search sequence file:

GGGGAGGAATATGAT

GGGGAGGAATATGAA

GGGGAGGAATATGCC

TCAAAAACATAGG

TCAAAAACATGGG

OUTPUT FILE:

GGGGAGGAATATGAT 0 # 0 mismatch exact sequence

GGGGAGGAATATGAA 1 # 1 mismatch

GGGGAGGAATATGCC 2 # 2 mismatch

TCAAAAACATAGG 2 # 2 mismatch

TCAAAAACATGGG 3 # 3 mismatch

Upvotes: 1

Views: 1189

Answers (2)

SES
SES

Reputation: 870

I think you should consider using an alignment tool designed for this data for a couple of reasons:

  • Those tools will also find reverse complemented matches (though, you could also implement this).
  • Aligners will properly handle paired-end reads and multiple matches.
  • Most aligners are written in C and use data structures and algorithms designed for this amount of data.

For those reasons, and others, any script you come up with will likely not be near as fast and complete as the tools that already exist. If you want to specify the number of mismatches to keep, instead of aligning all your reads and then parsing the output, you could use Vmatch if you have access to it (this tool is very fast and good for many matching tasks).

Upvotes: 0

clt60
clt60

Reputation: 63902

something like?

use 5.012;
use strict;
use warnings;
use String::Approx qw(aslice);
use File::Slurp;
use Data::Dumper;

my $genseq = "gseq.txt"; #the long sequence

$_ = read_file($genseq);

#read small patterns from stdin
while(my $patt = <>) {
    chomp $patt;
    my $len = length($patt);
    my($index, $size, $distance) = aslice($patt, ["3D0S3", "minimal_distance"]);
    say "$patt matched approx. at $index with mismatch $distance" if $distance <= 3;
}

for you input produces:

GGGGAGGAATATGAT matched approx. at 0 with mismatch 0
GGGGAGGAATATGAA matched approx. at 0 with mismatch 1
GGGGAGGAATATGCC matched approx. at 0 with mismatch 2
TCAAAAACATAGG matched approx. at 179 with mismatch 2
TCAAAAACATGGG matched approx. at 179 with mismatch 3

Honestly, haven't idea how will work with an 10000 chars long genseq...

Upvotes: 2

Related Questions