kdba
kdba

Reputation: 433

scoring a string based on a dictionary set of values

I have a string

strA='ABCCBAAABBCCABABC'

and a dictionary(list of lists) of length less than that of strA (say length 4) which has the probability of occurrence of of each character at a given position:

{"A":[0.4, 0.5, 0.1, 0.2], "B": [0.3, 0.5, 0.3, 0.6], "C":[0.3, 0.0, 0.6, 0.2]}

I would now like to have a sliding window of length 4 and score my entire strA based on this dictionary. So the first 4 characters from srtA are ABCC and based on the dictionary, the score would be:

(p(A) at pos0)*(p(B) at pos1)*(p(C)at pos2)*(p(C) at pos(3))

=0.4*0.5*0.6*0.2 =0.024

Now I would like to slide this window in increments of 1 and find the highest score possible along the entire string for this window length.

right now the way I am doing is:

    score_new=0
    window_len=4
    for a in range(0,(len(strA)-window_len):
            slide=strA[a:a+window_len]

            score=1
            for b in range(0,len(silde)):
                score=score*(dict[slide[b]][b])

            if score>score_new:
                score_new=score

        return score_new

which works but is time consuming. I have to do this scoring for 10000 strings with 1000 different windows of variable lengths each having a different probability dictionary.

Is there a quicker way to score the string based on a dictionary of probabilities and only return the highest score?

Upvotes: 2

Views: 122

Answers (2)

csunday95
csunday95

Reputation: 1329

Here's what I came up with:

def compute_best_score(test_string, prob_dict, window_size):
    # assertion to check for uniform lengths
    assert(all([len(prob_list) == window_size for prob_list in prob_dict.values()]))
    best_window_index = -1
    best_window_score = 0
    for index, letter in enumerate(test_string):
        # ignore last region of test_string as a window cannot be computed
        if index > len(test_string) - window_size:
            break
        current_window_score = 1.0
        for score_index in range(window_size):
            current_char = test_string[index + score_index]
            letter_scores = pdict[current_char]
            current_window_score *= letter_scores[score_index]
        if current_window_score > best_window_score:
            best_window_score = current_window_score
            best_window_index = index
    return best_window_score, best_window_index

This one seems to perform better in profiling under the python I'm using than the reduce based answer (which surprised me). Here's a link to a simple profiling run I did:

http://tpcg.io/S05WeL

Upvotes: 1

Sunitha
Sunitha

Reputation: 12015

>>> from functools import reduce
>>> strA='ABCCBAAABBCCABABC'
>>> d = {"A":[0.4, 0.5, 0.1, 0.2], "B": [0.3, 0.5, 0.3, 0.6], "C":[0.3, 0.0, 0.6, 0.2]}
>>> n = 4
>>>
>>> f = lambda t: reduce(lambda a,b: a*b, [d[c][i] for i,c in enumerate(t)])
>>> plist = map(f, zip(*[strA[i:] for i in range(n)]))
>>> plist
[0.024, 0.0, 0.0, 0.003, 0.003, 0.012000000000000002, 0.036, 0.012, 0.018, 0.0, 0.0, 0.009, 0.012000000000000002, 0.009]
>>> 
>>> max(plist)
0.036
>>> 

Upvotes: 0

Related Questions