Reputation: 195
Supplemental Data Link:
Each string length is 23, as long as the front 20 characters on the line, because the data is too large, I only pass one-fifth of 1the , there are fast you can paste the code, let the brother read it, Thank you!
Here's the formal question:
I now have two string arrays, tentatively called Candidates
and Bg_db
, all of them are short strings of length 20, and each string only contains character among the following four: A, T, C, G (right! Is the genome sequence!):
Candidates = [
'GGGAGCAGGCAAGGACTCTG',
'GCTCGGGCTTGTCCACAGGA',
'...',
# Be you see, these fragments of human genes in fact
]
Bg_db = [
'CTGCTGACGGGTGACACCCA',
'AGGAACTGGTGCTTGATGGC',
'...',
# This more, there are about one billion
]
My task is to candidates for each candidate, to find all less than or equal Bg_db
4 differences in the record, for example:
# The above one for the candidate, that is, a record candidates
# Intermediate | represent the same, * represent not the same
# The following represents a record of Bg_db
A T C G A T C G A T C G A T C G A T C G
| | | | | | | |
A T C G A T C G A T C G A T C G A T C G
A T C G A T C G A T C G A T C G A T C G
* The difference is 1
T T C G A T C G A T C G A T C G A T C G
A T C G A T C G A T C G A T C G A T C G
* The difference is 2
T T C G T T C G A T C G A T C G A T C G
A T C G A T C G A T C G A T C G A T C G
* | | * | | | * | | | The difference is 3
T T C G T T C G A T C C A T C G A T C G
A T C G A T C G A T G G A T C G A T C G
* | | * | | | * | |
T T C G T T C G A T C C A T C A A T C G
My problem is if you quickly find: every candidate in Bg_db
with a difference of less than or equal to 4 of all records, if the use of violent traversal, then Python as an example:
def align (candidate, record_from_bg_db):
Mismatches = 0
For i in range (20):
If candidate [i]! = Record_from_bg_db [i]:
Mismatches + = 1
If mismatches> = 4:
Return False
Return True
Candidate = 'GGGAGCAGGCAAGGACTCTG'
Record_from_bg_db = 'CTGCTGACGGGTGACACCCA'
Align(candidate, record_from_bg_db) # 1.24 microseconds or so
# total time:
10000000 * 1000000000 * 1.24 / 1000/1000/60/60/24/365
# = 393
# 1 million candidates, 1 billion bg_db records
# Takes about 393 years
# Completely unbearable ah
My idea is that Bg_db
is a highly ordered string (the length of each character may be only four), there is no algorithm that allows candidates to quickly compare all the Bg_db
, seeking advice.
Upvotes: 0
Views: 138
Reputation: 3395
What you are describing above is formally called the Hamming distance. You can read more here:
https://en.wikipedia.org/wiki/Hamming_distance
To paraphrase opening statement in the above link:
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.
A simple way to improve on this would be to exit the string comparison when you observe > 4 mismatches, but that would only reduce run time when there are many string comparisons that contains a large number of mismatches.
Your other option is to use more recent algorithms, such as those that implement Burrow-Wheeler transform (BWT). See complete list here:
https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-read_sequence_alignment
Using a BWT software tool, my suggestion would be to concatenate one of the sequence data sets into a single target sequence, then query it using the other set of short sequences. Reason being that BWT algorithms typically works well when the target consists of a few large sequences, such as chromosomes from the human genome. Once you obtain the alignments, you can filter out those alignments that span the positions where you joined the target sequences ie., alignments that span positions that are exact multiples of 20.
BWT-based tools are typically orders of magnitude faster than past DNA sequence alignment algorithms, and the usage case you present ie., ungapped alignment of short sequences, is what these programs were designed to optimize.
So, unless this is a homework assignment, I would recommend you research which software tool suits you needs best. Developing your own sequence alignment algorithm is difficult (from personal experience), and time spent in properly assessing existing tools will help you decide whether this time spent developing software is worthwhile.
Upvotes: 2