Reputation: 101
I have 36-nt reads like this: atcttgttcaatggccgatcXXXXgtcgacaatcaa
in the fastq file
with XXXX being the different barcodes. I want to search for a barcode in the file at exact position(21 to 24) and print the sequences with up to 3 mismatches in sequence not barcode.
For example:
I have barcode: aacg
search that barcode between position 21 to 24 in fastq file with allowing 3 mismatches in the sequence like:
atcttgttcaatggccgatcaacggtcgacaatcac # it has 1 mismatch
ttcttgttcaatggccgatcaacggtcgacaatcac # it has 2 mismatch
tccttgttcaatggccgatcaacggtcgacaatcac # it has 3 mismatch
I was trying to find unique lines first using awk and look for mismatches but it is very tedious for me to look and find them.
awk 'NR%4==2' 1.fq |sort|uniq -c|awk '{print $1"\t"$2}' > out1.txt
Is there any quick way i can find?
Thank you.
Upvotes: 0
Views: 1825
Reputation: 1217
Using python regex module allows you to specify the number of mismatches
import regex #intended as a replacement for re
from Bio import SeqIO
import collections
d = collections.defaultdict(list)
motif = r'((atcttgttcaatggccgatc)(....)(gtcgacaatcaa)){e<4}' #e<4 = less than 4 errors
records = list(SeqIO.parse(open(infile), "fastq"))
for record in records:
seq = str(record.seq)
match = regex.search(motif, seq, regex.BESTMATCH)
barcode = match.group(3)
sequence = match.group(0)
d[barcode].append(sequence) # store as a dictionary key = barcode, value = list of sequences
for k, v in d.items():
print("barcode = %s" % (k))
for i in v:
print("sequence = %s" % (i))
using capture groups, the fourth group (3), will be the barcode
Upvotes: 0
Reputation: 250931
Using Python:
strs = "atcttgttcaatggccgatcaacggtcgacaatcaa"
with open("1.fq") as f:
for line in f:
if line[20:24] == "aacg":
line = line.strip()
mismatches = sum(x!=y for x, y in zip(strs, line))
if mismatches <= 3:
print line, mismatches
atcttgttcaatggccgatcaacggtcgacaatcac 1
ttcttgttcaatggccgatcaacggtcgacaatcac 2
tccttgttcaatggccgatcaacggtcgacaatcac 3
Upvotes: 1
Reputation: 97948
Using Python:
import re
seq="atcttgttcaatggccgatcaacggtcgacaatcaa"
D = [ c for c in seq ]
with open("input") as f:
for line in f:
line=line.rstrip('\n')
if re.match(".{20}aacg", line):
cnt = sum([ 1 for c,d in zip(line,D) if c != d])
if cnt < 4:
print cnt, line
Upvotes: 0