Reputation: 138
for example for the tow columns
target read
AATGGCATC AATGGCATG
AATGATATA AAGGATATA
AATGATGTA CATGATGTA
I want to add the column
target read differnces
AATGGCATC AATGGCATG (C,G,8)
AATGATATA AAGGATATA (T,G,3)
AATGATGTA CATGATGTA (A,G,0)
Upvotes: 2
Views: 338
Reputation: 23099
Lets split on each word (whilst removing the initial whitespace) and create a stacked dataframe, there we can count each occurance using a cumulative count and drop all the duplicates whilst finally creating our tuple.
the key functions here will be explode
, str_split
, stack
and drop_duplicates
s = (
df.stack()
.str.split("")
.explode()
.to_frame("words")
.replace("", np.nan, regex=True)
.dropna()
)
s['enum'] = s.groupby(level=[0,1]).cumcount()
df["diff"] = (
s.reset_index(0)[
~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)
]
.groupby("level_0")
.agg(words=("words", ",".join), pos=("enum", "first"))
.agg(tuple, axis=1)
)
print(df)
target read diff
0 AATGGCATC AATGGCATG (C,G, 8)
1 AATGATATA AAGGATATA (T,G, 2)
2 AATGATGTA CATGATGTA (A,C, 0)
print(s.reset_index(0)[
~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)])
level_0 words enum
target 0 C 8
read 0 G 8
target 1 T 2
read 1 G 2
target 2 A 0
read 2 C 0
Upvotes: 2
Reputation: 156
I think this simple function might help you (Keep in mind that this is not a vectorised way of doing it):
import pandas as pd
import difflib as dl
# create a dataframe
# pass the columns as argument to the function below
# df refers to the data frame
def differences(a,b):
differences=[]
for i in range(len(a)):
l=list(dl.ndiff(a[i].strip(),b[i].strip()))
temp=[x[2] for x in l if x[0]!=' ' ]
for x in l:
if x[0]=='-' or x[0]=='+':
temp.append(l.index(x))
differences.append(tuple(temp[:3]))
return differences
df['differences']=differences(df['target'],df['read'])
print(df)
Upvotes: 1