Grendel
Grendel

Reputation: 597

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.

In fact I have 2 data frame:

import pandas 

blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")

Here is an exemple of their contents:

 >>> cluster
       cluster_name            seq_names
0                 1           g1.t1_0035
1                 1      g1.t1_0035_0042
2            119365           g1.t1_0042
3             90273      g1.t1_0042_0035
4             71567          g10.t1_0035
5             37976     g10.t1_0035_0042
6             22560          g10.t1_0042
7             90280     g10.t1_0042_0035
8             82698         g100.t1_0035
9             47392    g100.t1_0035_0042
10            28484         g100.t1_0042
11            22580    g100.t1_0042_0035
12            19474        g1000.t1_0035
13             5770   g1000.t1_0035_0042
14            29708        g1000.t1_0042
15            99776   g1000.t1_0042_0035
16             6283       g10000.t1_0035
17            39828  g10000.t1_0035_0042
18            25383       g10000.t1_0042
19           106614  g10000.t1_0042_0035
20             6285       g10001.t1_0035
21            13866  g10001.t1_0035_0042
22           121157       g10001.t1_0042
23           106615  g10001.t1_0042_0035
24             6286       g10002.t1_0035
25              113  g10002.t1_0035_0042
26            25397       g10002.t1_0042
27           106616  g10002.t1_0042_0035
28             4643       g10003.t1_0035
29            13868  g10003.t1_0035_0042
...             ...                  ...

and

    [78793 rows x 2 columns]
    >>> blast
                      qseqid               sseqid  pident  length  mismatch  \
0            g1.t1_0035_0042      g1.t1_0035_0042   100.0     286         0   
1            g1.t1_0035_0042           g1.t1_0035   100.0     257         0   
2            g1.t1_0035_0042        g9307.t1_0035    26.9     134        65   
3            g2.t1_0035_0042      g2.t1_0035_0042   100.0     445         0   
4            g2.t1_0035_0042           g2.t1_0035    95.8     451         3   
5            g2.t1_0035_0042  g24520.t1_0042_0035    61.1     429       137   
6            g2.t1_0035_0042        g9924.t1_0042    61.1     429       137   
7            g2.t1_0035_0042        g1838.t1_0035    86.2      29         4   
8            g3.t1_0035_0042      g3.t1_0035_0042   100.0     719         0   
9            g3.t1_0035_0042           g3.t1_0035    84.7     753        62   
10           g4.t1_0035_0042      g4.t1_0035_0042   100.0     242         0   
11           g4.t1_0035_0042           g3.t1_0035    98.8     161         2   
12           g5.t1_0035_0042      g5.t1_0035_0042   100.0     291         0   
13           g5.t1_0035_0042           g3.t1_0035    93.1     291         0   
14           g6.t1_0035_0042      g6.t1_0035_0042   100.0     152         0   
15           g6.t1_0035_0042           g4.t1_0035   100.0     152         0   
16           g7.t1_0035_0042      g7.t1_0035_0042   100.0     216         0   
17           g7.t1_0035_0042           g5.t1_0035    98.1     160         3   
18           g7.t1_0035_0042       g11143.t1_0042    46.5     230        99   
19           g7.t1_0035_0042  g27537.t1_0042_0035    40.8     233       111   
20        g3778.t1_0035_0042   g3778.t1_0035_0042   100.0      86         0   
21        g3778.t1_0035_0042        g6174.t1_0035    98.0      51         1   
22        g3778.t1_0035_0042  g20037.t1_0035_0042   100.0      50         0   
23        g3778.t1_0035_0042       g37190.t1_0035   100.0      50         0   
24        g3778.t1_0035_0042  g15112.t1_0042_0035    66.0      53        18   
25        g3778.t1_0035_0042        g6061.t1_0042    66.0      53        18   
26       g18109.t1_0035_0042  g18109.t1_0035_0042   100.0      86         0   
27       g18109.t1_0035_0042       g33071.t1_0035   100.0      81         0   
28       g18109.t1_0035_0042       g32810.t1_0035    96.4      83         3   
29       g18109.t1_0035_0042  g17982.t1_0035_0042    98.6      72         1   
...                      ...                  ...     ...     ...       ... 

if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID. What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )

And then, creat a function which displays the most similarity paires sequence within each cluster. here is an exemple:

let's say I have two clusters (dataframe cluster):

cluster 1: 
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7

...

On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:

seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100

and what I need to get is :

cluster 1 (best paired sequences): 
seq 1 vs seq 2

cluster2 (best paired sequences):
seq 5 vs seq6

...

So as you can see, I do not want to take into account the sequences paired by themselves

IF someone could give me some clues it would be fantastic.

Thank you all.

Upvotes: 0

Views: 201

Answers (2)

SpghttCd
SpghttCd

Reputation: 10860

Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.

Including cluster information and pairing information into one dataframe:

data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')

Then the data should only contain pairings of different sequences:

data = data[data['qseqid']!=data['sseqid']]

To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:

data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]

Now equal spec-values can be filtered the same way like it was done with equal seqids above:

data = data[data['qspec']!=data['sspec']]

In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:

data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]

The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.

Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:

import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]

Upvotes: 3

Tyler K
Tyler K

Reputation: 338

This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.

blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

Upvotes: 1

Related Questions