Reputation: 408
I'm trying to make a int (and tuple) list from a string list.
Let me explain what I'm planning to do and what makes it difficult for me to do it.
>>> STRINGS = ['GAT','GAC','ATCG','ATA','GTA']
>>> myFunc(STRINGS)
I got RESULT and NUMBERS in previous steps.
In this step, those lists should be transformed to advanced data structures.
RESULT = ['G','A','T','C','A','T','C','G','A','T','A']
NUMBERS = [1,2,3,4,5,6,7,8,9,10,11]
[(0,1), (1,2), (2,3), (2,4), ... ] or {(0,1), (1,2), (2,3), (2,4), ... }
{(0,1):'G', (1,2):'A', (2,3):'T', (2,4):'C', ...}
Plans can be difficult when the lengths of strings varies.
Comparing characters with those of prior strings are not easy enough.
Transforming a int list to tuple, Trie, Graph...
# SUMMARY
# Sorry, this is not a code.
# This shows how a string list is transformed to int (and tuple) list.
# 'GAT' -> 'G,A,T' -> 1,2,3 -> 1,2,3 -> (0,1),(1,2),(2,3)
# 'GAC' -> '-,-,C' -> -,-,4 -> 1,2,4 -> (0,1),(1,2),(2,4)
# 'ATCG' -> 'A,T,C,G' -> 5,6,7,8 -> 5,6,7,8 -> (0,5),(5,6),(6,7),(7,8)
# 'ATA' -> '-,-,A' -> -,-,9 -> 5,6,9 -> (0,5),(5,6),(6,9)
# 'GTA' -> '-,T,A' -> -,10,11 -> 1,10,11 -> (0,1),(1,10),(9,11)
# ['GAT','GAC','ATCG','ATA','GTA']
# -> ['GAT','C','ATCG','A','TA']
# -> ['G','A','T','C','A','T','C','G','A','T','A']
# -> [1,2,3,4,5,6,7,8,9,10,11]
# -> tuple list
# -> change tuple list to ordered set
# -> apply this to Python graph and Trie structures.
I would like to apply this to the Graph and Trie structures in Python. It will be grateful for any hint or advice. Thanks.
Updated in 2015.04.15
I wrote a code to get an int list from a string list.
def diff_idx(str1, str2):
"""
Returns a maximum common index number + 1
where the characters in both strings are same
>>> diff_idx('GAT','GAC')
2
"""
for i in range(min(len(str1), len(str2))):
if str1[i] == str2[i]:
i += 1
else:
return i
return i
def diff_idxl(xs, x):
"""
>>> diff_idxl(['GAT','GAC','ATCG','ATA'],'GTA')
1
"""
return max([diff_idx(s,x) for s in xs])
def num_seq(patterns):
"""
>>> num_seq(['GAT','GAC','ATCG','ATA','GTA'])
['G', 'A', 'T', 'C', 'A', 'T', 'C', 'G', 'A', 'T', 'A']
"""
lst = patterns[:]
answer = [c for c in lst[0]]
comp = [lst[0]]
for i in range(1, len(patterns)):
answer.extend(patterns[i][diff_idxl(comp,patterns[i]):])
comp.append(patterns[i])
return answer
I could get correct result with this code.
>>> num_seq(['GAT','GAC','ATCG','ATA','GTA'])
['G', 'A', 'T', 'C', 'A', 'T', 'C', 'G', 'A', 'T', 'A']
>>> # (index + 1) means a node in Trie structure.
Updated in 2015.04.17
I wrote a additional code to get what I want.
>>> # What I want to get is this...
>>> strings = ['GAT','GACA','ATC','GATG']
>>> nseq = num_seq(strings)
['G','A','T','C','A','A','T','C','G']
>>> make_matrix_trie(strings)
[[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [0, 0, 0, 9]]
My implementation of make_matrix is this.
def make_matrix_trie(patterns):
m = []
for pat in patterns:
m.append([0]*len(pat))
comp = num_seq(patterns)
comp.append(0)
idx = 1
for i in range(len(patterns)):
for j in range(len(patterns[i])):
if patterns[i][j] == comp[0]:
m[i][j] = idx
idx += 1
comp.pop(0)
else:
m[i][j] = 0
print (m,comp)
return m
But the result was not what I expected.
>>> make_matrix_trie(['GAT','GACA','ATC','GATG'])
[[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [9, 0, 0, 0]]
>>> # expected result:
>>> # [[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [0, 0, 0, 9]]
With some help, I think I can correct and complete my code.
Upvotes: 1
Views: 274
Reputation: 688
I haven't figured out your masking and integer assignment scheme. Does this have to do with nucleotides? Some elaboration would help.
I can help with the final step, however. Here's a one-liner to convert your integer lists to "tuple lists."
def listToTupleList(l):
return [(l[i-1],l[i]) if i!=0 else (0,l[i]) for i in range(len(l))]
Upvotes: 2