Reputation: 597

Filtering a dataframe of BLAST sequences to get within each cluster the maximum pident_x

I have a problem, I need to parse the following dataframe:

    cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
5   1   seq13_0042_0035 seq1_0035_0035  0.73    46  189 42  133
6   1   seq13_0042_0035 seq1_0035_0035  0.73    287 389 146 283
7   1   seq13_0042_0035 seq1_0035_0035  0.73    402 503 301 478
8   2   seq4_0042_0035  seq2_0035_0035  0.71    256 789 125 678
9   2   seq4_0042_0035  seq2_0035_0035  0.71    802 1056    706 985
10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
12  4   seq11_0035_0035 seq14_0042_0035 0.89    145 647 236 921
13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

Explanation of the dataframe and blast output:

cluster_name : is the cluster where one, two or more paired sequences are present.
qseqid : is the name of one sequence
sseqid : is the name of another sequence. These make one comparison qseqid vs sseqid
pident_x : is the score after the comparison (alignment) of these two sequences, 1 means that they are identical. When blast aligns the two sequence, it gives me coordinates of where my sequences are aligned ("homologous") in my alignment, for example if I have:
```
         10            24
```
seq1 : AAATTTCCCGGGATGCGATGACGATGAAAAAATTTGG xxxxxxxxx!!!!!!!!!!!!!!!xxxxxxxxxxxxx seq2 : GATGAGATCGGGATGCGATGAGGAGATAGAGATAGAG where x is a difference and ! is a match, blast will give me: qstart (start of the first seq) : 10 qend (endof the first seq) : 24 sstar (start of the second seq) : 10 send (end of the second seq) : 24

Note: this is an example but it does not necessarily begins at 0.

and what I actually want is only get within each cluster the maximum pident_x but the issue is that as you can see I can have reversed sequences (if you take a look at the 2,3,4 and 5,6,7 they are the same but reversed) and what I need to do is to keep only one for exemple only the line 2,3 and 4 because blast will compare every sequence, even reciprocals ones.

The output would be then:

cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
    2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
    3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
    4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
    10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
    13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
    14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

Indeed : for the cluster1: seq1_0035_0035 vs seq13_0042_0035 has his reversed seq13_0042_0035 seq1_0035_0035 but I only keep the first one.

for the cluster2: seq4_0042_0035 vs seq7_0035_0042 (0.83) has a better pident score than seq4_0042_0035 vs seq2_0035_0035 (0.71)

for the cluster4: seq11_0035_0035 vs seq17_0042_0042 (0.97) has a better pident score than seq11_0035_0035 vs seq14_0042_0035 (0.89)

for the custer5: There is only one paired sequence seq17_0035_0042 vs seq17_0042_0042 (0.94) , then I keep this one

I do not really know how to manage to do such a thing, someone has an idea?

part added:

Here is the script I used from thise small dataset (the same as in my example above): smalldata

blast=pd.read_csv("match.csv",header=0)

#blast=blast.drop(columns=[ "length", "mismatch", "gapopen", "evalue", "bitscore","pident"])

#Cluster Dataframe
cluster=pd.read_csv("cluster_test.csv",header=0)
cluster.columns = ["cluster_name", "seq_names"]

#Distance mean dataframe
dist=pd.read_csv("fnode.csv",header=0)
dist.columns = ["qseqid", "sseqid","pident","coverage"]
dist=dist.drop(columns=["coverage"])

#Including cluster information and distance mean information into one dataframe:
data = cluster.merge(dist, left_on='seq_names', right_on='qseqid')

#Adding for each two remaining dataframe a concatened colomn
data["name_concatened"] = data["qseqid"].map(str) + data["sseqid"]
blast["name_concatened"] = blast["qseqid"].map(str) + blast["sseqid"]
#We do not need these columns anymore
blast=blast.drop(columns=[ "qseqid","sseqid"])

#Including cluster information + distance mean information  + coordinate sequences from blast into one dataframe:
data = data.merge(blast, left_on='name_concatened', right_on='name_concatened')
data=data.drop(columns=[ "seq_names","name_concatened","pident_y"])

print(data)

this = data[["qseqid", "sseqid"]].apply(tuple, axis=1)
cum = pd.get_dummies(data[["sseqid", 'qseqid']].apply(tuple, axis=1)).cumsum()

this_zeros = pd.get_dummies(this)
this_zeros[:] = 0
pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1)
keep = pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1).lookup(data.index, this)

data=data[keep.astype(bool)]

print(data)

But as you can see here I only get:

  cluster_name           qseqid          sseqid  pident_x  qstart  qend  \
4             1  seq13_0042_0035  seq1_0035_0035      0.73      46   189   
5             1  seq13_0042_0035  seq1_0035_0035      0.73     287   389   
6             1  seq13_0042_0035  seq1_0035_0035      0.73     402   503   

   sstar  send  
4     42   133  
5    146   283  
6    301   478

and I should get:

cluster_name    qseqid  sseqid  pident_x    qstart  qend    sstar   send
        2   1   seq1_0035_0035  seq13_0042_0035 0.73    42  133 46  189
        3   1   seq1_0035_0035  seq13_0042_0035 0.73    146 283 287 389
        4   1   seq1_0035_0035  seq13_0042_0035 0.73    301 478 402 503
        10  2   seq4_0042_0035  seq7_0035_0042  0.83    123 745 156 723
        13  4   seq11_0035_0035 seq17_0042_0042 0.97    148 623 241 1002
        14  5   seq17_0035_0042 seq17_0042_0042 0.94    188 643 179 746

Here are my real data: datas

here is you exemple:

df = pd.DataFrame({'a': [1, 1, 4, 5, 2, 5], 'b': [7, 5, 2, 1, 4, 2]})
this = df[['a', 'b']].apply(tuple, axis=1)
cum = pd.get_dummies(df[['b', 'a']].apply(tuple, axis=1)).cumsum()
this_zeros = pd.get_dummies(this)
this_zeros[:] = 0
pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1)
keep = pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1).lookup(df.index, this)
df=df[keep.astype(bool)]
print(df)

and my result:

 a  b
3  5  1
4  2  4

Upvotes: 3

Filtering a dataframe of BLAST sequences to get within each cluster the maximum pident_x

Answers (2)

Related Questions