I own a genomic string, how to efficiently find the same part

Question

Supplemental Data Link:

Candidates: https://pan.baidu.com/s/1nvGWbrV
Bg_db: https://pan.baidu.com/s/1sllFLAd

Each string length is 23, as long as the front 20 characters on the line, because the data is too large, I only pass one-fifth of 1the , there are fast you can paste the code, let the brother read it, Thank you!

Here's the formal question:

I now have two string arrays, tentatively called Candidates and Bg_db, all of them are short strings of length 20, and each string only contains character among the following four: A, T, C, G (right! Is the genome sequence!):

Candidates = [
    'GGGAGCAGGCAAGGACTCTG',
    'GCTCGGGCTTGTCCACAGGA',
    '...',
    # Be you see, these fragments of human genes in fact
]

Bg_db = [
    'CTGCTGACGGGTGACACCCA',
    'AGGAACTGGTGCTTGATGGC',
    '...',
    # This more, there are about one billion
]

My task is to candidates for each candidate, to find all less than or equal Bg_db 4 differences in the record, for example:

# The above one for the candidate, that is, a record candidates
# Intermediate | represent the same, * represent not the same
# The following represents a record of Bg_db

A T C G A T C G A T C G A T C G A T C G
| | | | | | | |
A T C G A T C G A T C G A T C G A T C G

A T C G A T C G A T C G A T C G A T C G
* The difference is 1
T T C G A T C G A T C G A T C G A T C G

A T C G A T C G A T C G A T C G A T C G
* The difference is 2
T T C G T T C G A T C G A T C G A T C G

A T C G A T C G A T C G A T C G A T C G
* | | * | | | * | | | The difference is 3
T T C G T T C G A T C C A T C G A T C G

A T C G A T C G A T G G A T C G A T C G
* | | * | | | * | |
T T C G T T C G A T C C A T C A A T C G

My problem is if you quickly find: every candidate in Bg_db with a difference of less than or equal to 4 of all records, if the use of violent traversal, then Python as an example:

def align (candidate, record_from_bg_db):
    Mismatches = 0
    For i in range (20):
        If candidate [i]! = Record_from_bg_db [i]:
            Mismatches + = 1
            If mismatches> = 4:
                Return False
    Return True

Candidate = 'GGGAGCAGGCAAGGACTCTG'
Record_from_bg_db = 'CTGCTGACGGGTGACACCCA'

Align(candidate, record_from_bg_db) # 1.24 microseconds or so

# total time:

10000000 * 1000000000 * 1.24 / 1000/1000/60/60/24/365
# = 393
# 1 million candidates, 1 billion bg_db records
# Takes about 393 years
# Completely unbearable ah

My idea is that Bg_db is a highly ordered string (the length of each character may be only four), there is no algorithm that allows candidates to quickly compare all the Bg_db, seeking advice.

I own a genomic string, how to efficiently find the same part

Answers (1)

Related Questions