Reputation: 597
I have a problem, I need to parse the following dataframe:
cluster_name qseqid sseqid pident_x qstart qend sstar send
2 1 seq1_0035_0035 seq13_0042_0035 0.73 42 133 46 189
3 1 seq1_0035_0035 seq13_0042_0035 0.73 146 283 287 389
4 1 seq1_0035_0035 seq13_0042_0035 0.73 301 478 402 503
5 1 seq13_0042_0035 seq1_0035_0035 0.73 46 189 42 133
6 1 seq13_0042_0035 seq1_0035_0035 0.73 287 389 146 283
7 1 seq13_0042_0035 seq1_0035_0035 0.73 402 503 301 478
8 2 seq4_0042_0035 seq2_0035_0035 0.71 256 789 125 678
9 2 seq4_0042_0035 seq2_0035_0035 0.71 802 1056 706 985
10 2 seq4_0042_0035 seq7_0035_0042 0.83 123 745 156 723
12 4 seq11_0035_0035 seq14_0042_0035 0.89 145 647 236 921
13 4 seq11_0035_0035 seq17_0042_0042 0.97 148 623 241 1002
14 5 seq17_0035_0042 seq17_0042_0042 0.94 188 643 179 746
Explanation of the dataframe and blast output:
cluster_name :
is the cluster where one, two or more paired sequences are present.qseqid
: is the name of one sequencesseqid
: is the name of another sequence. These make one comparison qseqid vs sseqid pident_x
: is the score after the comparison (alignment) of these two sequences, 1 means that they are identical.
When blast aligns the two sequence, it gives me coordinates of where my sequences are aligned ("homologous") in my alignment, for example if I have:
10 24
seq1 : AAATTTCCCGGGATGCGATGACGATGAAAAAATTTGG xxxxxxxxx!!!!!!!!!!!!!!!xxxxxxxxxxxxx seq2 : GATGAGATCGGGATGCGATGAGGAGATAGAGATAGAG where x is a difference and ! is a match, blast will give me: qstart (start of the first seq) : 10 qend (endof the first seq) : 24 sstar (start of the second seq) : 10 send (end of the second seq) : 24
Note: this is an example but it does not necessarily begins at 0.
and what I actually want is only get within each cluster the maximum pident_x but the issue is that as you can see I can have reversed sequences (if you take a look at the 2,3,4 and 5,6,7 they are the same but reversed) and what I need to do is to keep only one for exemple only the line 2,3 and 4 because blast will compare every sequence, even reciprocals ones.
The output would be then:
cluster_name qseqid sseqid pident_x qstart qend sstar send
2 1 seq1_0035_0035 seq13_0042_0035 0.73 42 133 46 189
3 1 seq1_0035_0035 seq13_0042_0035 0.73 146 283 287 389
4 1 seq1_0035_0035 seq13_0042_0035 0.73 301 478 402 503
10 2 seq4_0042_0035 seq7_0035_0042 0.83 123 745 156 723
13 4 seq11_0035_0035 seq17_0042_0042 0.97 148 623 241 1002
14 5 seq17_0035_0042 seq17_0042_0042 0.94 188 643 179 746
Indeed :
for the cluster1:
seq1_0035_0035 vs seq13_0042_0035
has his reversed seq13_0042_0035 seq1_0035_0035
but I only keep the first one.
for the cluster2:
seq4_0042_0035 vs seq7_0035_0042 (0.83)
has a better pident score than seq4_0042_0035 vs seq2_0035_0035 (0.71)
for the cluster4:
seq11_0035_0035 vs seq17_0042_0042 (0.97)
has a better pident score than seq11_0035_0035 vs seq14_0042_0035 (0.89)
for the custer5:
There is only one paired sequence seq17_0035_0042 vs seq17_0042_0042
(0.94) , then I keep this one
I do not really know how to manage to do such a thing, someone has an idea?
part added:
Here is the script I used from thise small dataset (the same as in my example above): smalldata
blast=pd.read_csv("match.csv",header=0)
#blast=blast.drop(columns=[ "length", "mismatch", "gapopen", "evalue", "bitscore","pident"])
#Cluster Dataframe
cluster=pd.read_csv("cluster_test.csv",header=0)
cluster.columns = ["cluster_name", "seq_names"]
#Distance mean dataframe
dist=pd.read_csv("fnode.csv",header=0)
dist.columns = ["qseqid", "sseqid","pident","coverage"]
dist=dist.drop(columns=["coverage"])
#Including cluster information and distance mean information into one dataframe:
data = cluster.merge(dist, left_on='seq_names', right_on='qseqid')
#Adding for each two remaining dataframe a concatened colomn
data["name_concatened"] = data["qseqid"].map(str) + data["sseqid"]
blast["name_concatened"] = blast["qseqid"].map(str) + blast["sseqid"]
#We do not need these columns anymore
blast=blast.drop(columns=[ "qseqid","sseqid"])
#Including cluster information + distance mean information + coordinate sequences from blast into one dataframe:
data = data.merge(blast, left_on='name_concatened', right_on='name_concatened')
data=data.drop(columns=[ "seq_names","name_concatened","pident_y"])
print(data)
this = data[["qseqid", "sseqid"]].apply(tuple, axis=1)
cum = pd.get_dummies(data[["sseqid", 'qseqid']].apply(tuple, axis=1)).cumsum()
this_zeros = pd.get_dummies(this)
this_zeros[:] = 0
pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1)
keep = pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1).lookup(data.index, this)
data=data[keep.astype(bool)]
print(data)
But as you can see here I only get:
cluster_name qseqid sseqid pident_x qstart qend \
4 1 seq13_0042_0035 seq1_0035_0035 0.73 46 189
5 1 seq13_0042_0035 seq1_0035_0035 0.73 287 389
6 1 seq13_0042_0035 seq1_0035_0035 0.73 402 503
sstar send
4 42 133
5 146 283
6 301 478
and I should get:
cluster_name qseqid sseqid pident_x qstart qend sstar send
2 1 seq1_0035_0035 seq13_0042_0035 0.73 42 133 46 189
3 1 seq1_0035_0035 seq13_0042_0035 0.73 146 283 287 389
4 1 seq1_0035_0035 seq13_0042_0035 0.73 301 478 402 503
10 2 seq4_0042_0035 seq7_0035_0042 0.83 123 745 156 723
13 4 seq11_0035_0035 seq17_0042_0042 0.97 148 623 241 1002
14 5 seq17_0035_0042 seq17_0042_0042 0.94 188 643 179 746
Here are my real data: datas
here is you exemple:
df = pd.DataFrame({'a': [1, 1, 4, 5, 2, 5], 'b': [7, 5, 2, 1, 4, 2]})
this = df[['a', 'b']].apply(tuple, axis=1)
cum = pd.get_dummies(df[['b', 'a']].apply(tuple, axis=1)).cumsum()
this_zeros = pd.get_dummies(this)
this_zeros[:] = 0
pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1)
keep = pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1).lookup(df.index, this)
df=df[keep.astype(bool)]
print(df)
and my result:
a b
3 5 1
4 2 4
Upvotes: 3
Views: 482
Reputation: 76297
Here is another answer, hopefully more efficient.
I'll focus on the main point in your question: seeing if the reverse pair in a couple of columns occurred in a previous row. As an example, I'll use
df = pd.DataFrame({'a': [1, 1, 4, 5, 2, 5], 'b': [7, 5, 2, 1, 4, 2]})
>>> df
a b
0 1 7
1 1 5
2 4 2
3 5 1
4 2 4
5 5 2
Let's find the tuples of each row:
this = df[['a', 'b']].apply(tuple, axis=1)
>>> this
0 (1, 7)
1 (1, 5)
2 (4, 2)
3 (5, 1)
4 (2, 4)
5 (5, 2)
dtype: object
Here is the cumulative sum of reverse tuples' appearances:
cum = pd.get_dummies(df[['b', 'a']].apply(tuple, axis=1)).cumsum()
>>> cum
(1, 5) (2, 4) (2, 5) (4, 2) (5, 1) (7, 1)
0 0.0 0.0 0.0 0.0 1.0 1.0
1 0.0 1.0 0.0 0.0 1.0 1.0
2 1.0 1.0 0.0 0.0 1.0 1.0
3 1.0 1.0 0.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN NaN NaN NaN NaN NaN
To this we need to show that the tuple of the current row, if it didn't exist in reverse, was not found:
this_zeros = pd.get_dummies(this)
this_zeros[:] = 0
>>> pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1)
(1.0, 5.0) (2.0, 4.0) (2.0, 5.0) (4.0, 2.0) (5.0, 1.0) (nan, nan) (1, 7) (2, 2) (5, 2)
0 0.0 0.0 0.0 0.0 1.0 0.0 0 0 0
1 0.0 1.0 0.0 0.0 1.0 0.0 0 0 0
2 1.0 1.0 0.0 0.0 1.0 0.0 0 0 0
3 1.0 1.0 0.0 1.0 1.0 0.0 0 0 0
4 1.0 1.0 1.0 1.0 1.0 0.0 0 0 0
5 1.0 1.0 1.0 1.0 1.0 1.0 0 0 0
6 NaN NaN NaN NaN NaN NaN 0 0 0
Now we can use this DataFrame to look up the current tuple:
keep = pd.concat([cum, this_zeros[this_zeros.columns.difference(cum.columns)]], axis=1).lookup(df.index, this)
We should keep in the original DataFrame
>>> df[keep.astype(bool)]
a b
0 1 7
1 1 5
2 4 2
5 5 2
Upvotes: 0
Reputation: 76297
If you create a tuple out of the columns, then perform a cumulative sum, you can check if the reversed pair already appears in the cumulative sum:
df[~pd.DataFrame({
'tup': df[['sseqid', 'qseqid']].apply(tuple, axis=1),
'inv_tups': df[['qseqid', 'sseqid']].apply(lambda t: (tuple(t), ), axis=1).cumsum().shift(1)}
).apply(lambda r: isinstance(r.inv_tups, tuple) and r.tup in r.inv_tups, axis=1)]
df[['sseqid', 'qseqid']].apply(tuple, axis=1)
creates tuples out of the columns.
df[['qseqid', 'sseqid']].apply(lambda t: (tuple(t), ), axis=1).cumsum().shift(1)
creates inverse tuples, and cumulatively sums them in tuples
Upvotes: 1