Reputation: 524
I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})
Upvotes: 1
Views: 526
Reputation: 59549
Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4
Upvotes: 1