André Da Silva
André Da Silva

Reputation: 47

Comparison between one element and all the others of a DataFrame column

I have a list of tuples which I turned into a DataFrame with thousands of rows, like this:

                                          frag         mass  prot_position
0                               TFDEHNAPNSNSNK  1573.675712              2
1                                EPGANAIGMVAFK  1303.659458             29
2                                         GTIK   417.258734              2
3                                     SPWPSMAR   930.438172             44
4                                         LPAK   427.279469             29
5                          NEDSFVVWEQIINSLSALK  2191.116099             17
...

and I have the follow rule:

def are_dif(m1, m2, ppm=10):
    if abs((m1 - m2) / m1) < ppm * 0.000001:
        v = False
    else:
        v = True
    return v

So, I only want the "frag"s that have a mass that difers from all the other fragments mass. How can I achieve that "selection"?

Then, I have a list named "pinfo" that contains:

d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.

So, I want to sum 1 to the "hits_fit" value, on the dictionary respective to the protein.

Upvotes: 3

Views: 73

Answers (3)

JohnE
JohnE

Reputation: 30414

If I'm understanding correctly (not sure if I am), you can accomplish quite a bit by just sorting. First though, let me adjust the data to have a mix of close and far values for mass:

   Unnamed: 0                 frag         mass  prot_position
0           0       TFDEHNAPNSNSNK  1573.675712              2
1           1        EPGANAIGMVAFK  1573.675700             29
2           2                 GTIK   417.258734              2
3           3             SPWPSMAR   417.258700             44
4           4                 LPAK   427.279469             29
5           5  NEDSFVVWEQIINSLSALK  2191.116099             17

Then I think you can do something like the following to select the "good" ones. First, create 'pdiff' (percent difference) to see how close mass is to the nearest neighbors:

ppm = .00001
df = df.sort('mass')

df['pdiff'] = (df.mass-df.mass.shift()) / df.mass

   Unnamed: 0                 frag         mass  prot_position         pdiff
3           3             SPWPSMAR   417.258700             44           NaN
2           2                 GTIK   417.258734              2  8.148421e-08
4           4                 LPAK   427.279469             29  2.345241e-02
1           1        EPGANAIGMVAFK  1573.675700             29  7.284831e-01
0           0       TFDEHNAPNSNSNK  1573.675712              2  7.625459e-09
5           5  NEDSFVVWEQIINSLSALK  2191.116099             17  2.817926e-01

The first and last data lines make this a little tricky so this next line backfills the first line and repeats the last line so that the following mask works correctly. This works for the example here, but might need to be tweaked for other cases (but only as far as the first and last lines of data are concerned).

df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]

Results:

   Unnamed: 0                 frag         mass  prot_position     pdiff
4           4                 LPAK   427.279469             29  0.023452
5           5  NEDSFVVWEQIINSLSALK  2191.116099             17  0.281793

Sorry, I don't understand the second part of the question at all.

Edit to add: As mentioned in a comment to @AmiTavory's answer, I think possibly the sorting approach and groupby approach could be combined for a simpler answer than this. I might try at a later time, but everyone should feel free to give this a shot themselves if interested.

Upvotes: 2

Liron
Liron

Reputation: 545

Another solution can be create a dup of your list (if you need to preserve it for further processing later), iterate over it and remove all element that are not corresponding with your rule (m1 & m2).

You will get a new list with all unique masses.

Just don't forget that if you do need to use the original list later you will need to use deepcopy.

Upvotes: 0

Ami Tavory
Ami Tavory

Reputation: 76297

Here's something that's slightly different from what you asked, but it is very simple, and I think gives a similar effect.

Using numpy.round, you can create a new column

import numpy as np

df['roundedMass'] = np.round(df.mass, 6)

Following that, you can do a groupby of the frags on the rounded mass, and use nunique to count the numbers in the group. Filter for the groups of size 1.

So, the number of frags per bin is:

df.frag.groupby(np.round(df.mass, 6)).nunique()

Upvotes: 1

Related Questions