Reputation: 43

Find consecutive series in list of tuples in python

I am struggling with the following issue: I would like to write some small code to deisotope mass spec data.

For this, I compare, if the difference between two signals is equal the mass of a proton devided by the charge state. So far, so easy.

I am struggling now, to find series of more than two peaks.

I broke down the problem to having a list of tuples, and a series are n tuples, where the last number of the previous tuple is equal the first tuple of the current tuple.

From this:

[(1,2), (2,3), (4,5), (7,9), (8,10), (9,11)]

To this:

[(1,2,3), (4,5), (7,9,11), (8,10)]

Simple order will fail, as there might be jumps (7-->9) and an intermediate signal (8,10)

Here is some test data:

import numpy as np

proton = 1.0078250319 

data = [
 (632.3197631835938, 2244.3374),  #0
 (632.830322265625, 2938.797),    #1
 (634.3308715820312, 1567.1385),  #2
 (639.3309326171875, 80601.41),   #3
 (640.3339233398438, 23759.367),  #4
 (641.3328247070312, 4771.9946),  #5
 (641.8309326171875, 2735.902),   #6
 (642.3365478515625, 4600.567),   #7
 (642.84033203125, 1311.657),     #8
 (650.34521484375, 11952.237),    #9
 (650.5, 1),                      #10 
 (650.84228515625, 10757.939),    #11
 (651.350341796875, 6324.9023),   #12
 (651.8455200195312, 1398.8452),  #13
 (654.296875, 1695.3457)]         #14

mz, i = zip(*data)

mz = np.array(mz)
i = np.array(i)

arr = np.triu(mz - mz[:, np.newaxis])

charge = 2

So actually, in the first step, I am just interested in the mz values. I substract all values from all values and isolate the upper triangle.

To calculate, if two signals are actually within the correct mass, I then use the following code:

>>> pairs = tuple(zip(*np.where(abs(arr - (proton / charge)) < 0.01)))
((0, 1), (5, 6), (6, 7), (7, 8), (9, 11), (11, 12), (12, 13))

Now, to corresponding signals are clear by eye:

Peak 1: 0 to 1

Peak 2: 5 to 8

Peak 3: 9 to 13, without 10.

So in principle, I would like compare the 2nd value of each tuple with the first tuple of any other, to identify consequtive sequnces.

What I tried, is to flatten the list, remove duplicates and find consequtive counting in this 1D list. But this fails, as a peak from 5-9 is found.

I would like to have a vectorized solution, as this calculation is done for 100-500 signals for multiple charge states in 30000+ spectra.

I am pretty sure, this had been asked before, but was not able to find a suitable solution.

Eventually, these series are than used to check the intensity of the corresponding peaks, sum them and use the biggest initial value to assign the deisotoped peak here.

Thx Christian

ps. also if there are some suggestion to the already existing code, I am happy to learn. I am pretty new to vectorized calculation, and usually wrote tons of for loops, which take ages to finish.

Upvotes: 2

Answers (4)

msi_gerva

Reputation: 2078

You can also try this, unfortunately not O(n) solution, but O(nlog(n)) or O(n*n) due to double cycle:

#!/usr/bin/env ipython
# --------------------
datain = [(1,2), (2,3), (4,5), (7,9), (8,10), (9,11)]
dataout = [];
# ---------------------------------------------------
for ii,val_a in enumerate(datain):
    appendval = set(val_a)
    for jj,val_b in enumerate(datain[ii+1:]):
        # -------------------------------------------
        if len(set(val_a).intersection(set(val_b)))>0:
            appendval = appendval.union(val_b)
            datain.remove(val_b)
        # -------------------------------------------
    dataout.append(tuple(appendval))
# ---------------------------------------------------

Upvotes: 0

Ravnclaw

Reputation: 43

Thanks for all the responses.

I really liked the networkx approach, but interestingly, this is slower then O(n) loop solution. I benchmarked both with my original data, as well as a randomized training set. In both scenarios, the results were equal (which is a prerequisite), but the O(n) solution was ~4 to 10 times faster.

Here are the benchmark

networkx
size: 10
22.9 µs ± 1.92 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 100
165 µs ± 4.59 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 1000
2.08 ms ± 52.4 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 10000
23.1 ms ± 499 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 100000
350 ms ± 6.12 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 1000000
4.82 s ± 120 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)

loop
size: 10
3.31 µs ± 418 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 100
35.3 µs ± 5.68 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 1000
350 µs ± 22.2 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 10000
3.95 ms ± 38.5 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 100000
71.6 ms ± 278 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
size: 1000000
1.11 s ± 30.8 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)

I think at very very large sample size, NetworkX might perform better, but in my scenario, small samples (< 1000 initial signals), but this very often, the loop solution wins.

And here the code to reproduce.

import numpy as np
import networkx as nx

def p1(edge):
    # from https://stackoverflow.com/a/74535322/14571316
    
    G = nx.Graph()
    G.add_edges_from(edge)

    # Use connected_components to detect the subgraphs.
    d = list(nx.connected_components(G)) 
    
    return d

def p2(pairs):
    # from https://stackoverflow.com/a/74535164/14571316 
    d = {}
    for t in pairs:
        if t[0] in d:
           d[t[0]].append(t[1])
           d[t[1]] = d.pop(t[0])
        else:
            d[t[1]] = list(t)
    signals = tuple(tuple(v) for v in d.values())
    
    return signals

print("networkx")
for size in [10, 100, 1000, 10000, 100000, 1000000]:
    print(f'size: {size}')
    l1 = np.random.randint(0, high=int(size/2), size=size)
    l2 = np.random.randint(0, high=int(size/2), size=size)
    pairs = tuple(zip(l1, l2))
    %timeit -n10 -r2 p1(pairs)
    
print("loop")
for size in [10, 100, 1000, 10000, 100000, 1000000]:
    print(f'size: {size}')
    l1 = np.random.randint(0, high=int(size/2), size=size)
    l2 = np.random.randint(0, high=int(size/2), size=size)
    pairs = tuple(zip(l1, l2))
    %timeit -n10 -r2 p2(pairs)

Upvotes: 1

obchardon

Reputation: 10792

In graph theory your problem will be "How to find all disconnected subgraph in a graph ?".

So why not using a network analysis library such as networkx:

import networkx as nx
# Your tuples become the edges of the graph.
edge = [(1,2), (2,3), (4,5), (7,9), (8,10), (9,11)]

# We create the graph
G = nx.Graph()
G.add_edges_from(edge)

# Use connected_components to detect the subgraphs.
d = list(nx.connected_components(G))

And we obtain as expected:

[{1, 2, 3}, {4, 5}, {7, 9, 11}, {8, 10}]

With 4 subgraphs:

Upvotes: 5

CodeKorn

Reputation: 298

Try:

d = {}
pairs = [(1,2), (2,3), (4,5), (7,9), (8,10), (9,11)]
for t in pairs:
    if t[0] in d:
       d[t[0]].append(t[1])
       d[t[1]] = d.pop(t[0])
    else:
        d[t[1]] = list(t)
signals = tuple(tuple(v) for v in d.values())

Though the solution isn't vectorized it will be O(n) time. Note that this solution only works if you pairs are sorted.

Upvotes: 2

Find consecutive series in list of tuples in python

Answers (4)

Related Questions