Larynx
Larynx

Reputation: 408

Making a integer (and tuple) list from string list in Python

I'm trying to make a int (and tuple) list from a string list.
Let me explain what I'm planning to do and what makes it difficult for me to do it.

My Coding Plan

A. My function (myFunc) takes a string list as its argument.

   >>> STRINGS = ['GAT','GAC','ATCG','ATA','GTA']  
   >>> myFunc(STRINGS)

B. Then, myFunc arrange all the characters in a 'special' way and it returns a new character list (RESULT).

1) 'GAT' - the first string 2) 'GAC' - the second string 3) Iterate this process with all the remained strings in the STRINGS.

C. Transform NUMBERS and RESULT to advanced data structures.

I got RESULT and NUMBERS in previous steps.
In this step, those lists should be transformed to advanced data structures.

RESULT = ['G','A','T','C','A','T','C','G','A','T','A']  
NUMBERS = [1,2,3,4,5,6,7,8,9,10,11]  
[(0,1), (1,2), (2,3), (2,4), ... ] or {(0,1), (1,2), (2,3), (2,4), ... }  
{(0,1):'G', (1,2):'A', (2,3):'T', (2,4):'C', ...}  

What is difficult for me to implement.

Plans can be difficult when the lengths of strings varies.
Comparing characters with those of prior strings are not easy enough.
Transforming a int list to tuple, Trie, Graph...

# SUMMARY  
# Sorry, this is not a code.
# This shows how a string list is transformed to int (and tuple) list.

# 'GAT'  ->  'G,A,T'  ->  1,2,3   ->  1,2,3  ->  (0,1),(1,2),(2,3)  
# 'GAC'  ->  '-,-,C'  ->  -,-,4   ->  1,2,4  ->  (0,1),(1,2),(2,4)  
# 'ATCG' -> 'A,T,C,G' -> 5,6,7,8  -> 5,6,7,8 ->  (0,5),(5,6),(6,7),(7,8)  
# 'ATA'  ->  '-,-,A'  ->  -,-,9   ->  5,6,9  ->  (0,5),(5,6),(6,9)  
# 'GTA'  ->  '-,T,A'  ->  -,10,11 -> 1,10,11 ->  (0,1),(1,10),(9,11)  

# ['GAT','GAC','ATCG','ATA','GTA']
# -> ['GAT','C','ATCG','A','TA']
# -> ['G','A','T','C','A','T','C','G','A','T','A']
# -> [1,2,3,4,5,6,7,8,9,10,11]
# -> tuple list
# -> change tuple list to ordered set
# -> apply this to Python graph and Trie structures.

I would like to apply this to the Graph and Trie structures in Python. It will be grateful for any hint or advice. Thanks.


Updated in 2015.04.15
I wrote a code to get an int list from a string list.

def diff_idx(str1, str2):
    """
    Returns a maximum common index number + 1 
    where the characters in both strings are same 
    >>> diff_idx('GAT','GAC')
        2
    """
    for i in range(min(len(str1), len(str2))):
        if str1[i] == str2[i]:
            i += 1
        else:
            return i
    return i

def diff_idxl(xs, x):
    """
    >>> diff_idxl(['GAT','GAC','ATCG','ATA'],'GTA')
        1
    """
    return max([diff_idx(s,x) for s in xs])

def num_seq(patterns):
    """
    >>> num_seq(['GAT','GAC','ATCG','ATA','GTA'])
        ['G', 'A', 'T', 'C', 'A', 'T', 'C', 'G', 'A', 'T', 'A']
    """
    lst = patterns[:]
    answer = [c for c in lst[0]]
    comp = [lst[0]]
    for i in range(1, len(patterns)):
        answer.extend(patterns[i][diff_idxl(comp,patterns[i]):])
        comp.append(patterns[i])
    return answer

I could get correct result with this code.

>>> num_seq(['GAT','GAC','ATCG','ATA','GTA'])
    ['G', 'A', 'T', 'C', 'A', 'T', 'C', 'G', 'A', 'T', 'A']
>>> # (index + 1) means a node in Trie structure.

Updated in 2015.04.17
I wrote a additional code to get what I want.

>>> # What I want to get is this... 
>>> strings = ['GAT','GACA','ATC','GATG']
>>> nseq = num_seq(strings)
    ['G','A','T','C','A','A','T','C','G']
>>> make_matrix_trie(strings)
    [[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [0, 0, 0, 9]]

My implementation of make_matrix is this.

def make_matrix_trie(patterns):
    m = []
    for pat in patterns:
        m.append([0]*len(pat))

    comp = num_seq(patterns)
    comp.append(0)

    idx = 1
    for i in range(len(patterns)):
        for j in range(len(patterns[i])):
            if patterns[i][j] == comp[0]:
                m[i][j] = idx
                idx += 1
                comp.pop(0)
            else:
                m[i][j] = 0
            print (m,comp)
    return m

But the result was not what I expected.

>>> make_matrix_trie(['GAT','GACA','ATC','GATG'])
    [[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [9, 0, 0, 0]]
>>> # expected result:
>>> # [[1, 2, 3], [0, 0, 4, 5], [6, 7, 8], [0, 0, 0, 9]]

With some help, I think I can correct and complete my code.

Upvotes: 1

Views: 274

Answers (1)

Static Void
Static Void

Reputation: 688

I haven't figured out your masking and integer assignment scheme. Does this have to do with nucleotides? Some elaboration would help.

I can help with the final step, however. Here's a one-liner to convert your integer lists to "tuple lists."

def listToTupleList(l):
    return [(l[i-1],l[i]) if i!=0 else (0,l[i]) for i in range(len(l))]

Upvotes: 2

Related Questions