Reputation: 413
I may be making this a lot harder than it has to be.
data frame looks like this:
CHROMOSOME START END
CHR1 100 200
CHR2 300 400
my goal is to make a data frame from this with 4 rows that looks like this.
CHROMOSOME START END LABEL
CHR1 150 250 ROW_1_A
CHR1 170 270 ROW_1_B
CHR2 350 300 ROW_2_A
CHR2 370 400 ROW_2_B
So I need to take each row, split it into A and B, and modify start and end, and then label the row A or B and rebuild this back into a data frame.
Here is my function to split, modify, and label a single row.
def getcoordinates(df, awindow = 500, bwindow = 500):
index = df[0]
chromosome = df[1]
start = df[2]
end = df[3]
sv_length = df[8]
track = {'CHROMOSOME': chromosome,
'START': start,
'END': end}
track = pd.DataFrame(data=track, index=[0])
trackA = track.copy()
trackB = track.copy()
trackA = trackA.assign(LABEL = ("AVN_DEL_" + str(index) + "_A"))
trackB = trackB.assign(LABEL = ("AVN_DEL_" + str(index) + "_B"))
trackA = trackA.assign(END = trackA["START"])
trackA = trackA.assign(START = trackA["START"] - awindow)
trackB = trackB.assign(START = trackB["END"])
trackB = trackB.assign(END = trackB["END"] + bwindow)
return trackA.append(trackB)
here is my for loop to perform this on each row in the dataframe and reassemble.
appended_data = []
for row in SV.itertuples():
print(row)
out = getcoordinates(row)
appended_data.append(out)
appended_data = pd.concat(appended_data, axis=1)
here is the actual code being run.
appended_data = []
for row in SV.itertuples():
print(row)
out = getcoordinates(row)
appended_data.append(out)
appended_data = pd.concat(appended_data, axis=1)
Pandas(Index=0, CHROMOSOME=u'chr1', START=56365453, END=56369289, SV_TYPE=u'DEL', CALLERS=u'GROM;delly;manta;lumpy', LEFT_JUNCTION=u'L1M', RIGHT_JUNCTION=u'L1M', SV_LENGTH=3836, _9=u'DGV', FULL_INFO_ABOUT_ME=u'4_L1MC4_56365281_56365445_92_2.4;L1HS_56365452_56369282_101_2.63;L1HS_56365452_56369282_93_2.42;L1MC4_56369289_56369625_100_2.61')
Pandas(Index=1, CHROMOSOME=u'chr1', START=75645801, END=79014667, SV_TYPE=u'DEL', CALLERS=u'GROM;manta;lumpy', LEFT_JUNCTION=u'L1P', RIGHT_JUNCTION=u'L1P', SV_LENGTH=3368866, _9=u' ', FULL_INFO_ABOUT_ME=u'2_L1PA5_75644642_75646421_300_0.01;L1PA4_79013861_79016088_300_0.01')
appended_data.head()
CHROMOSOME END START ... END START LABEL
0 chr1 56365453 56364953 ... 75645801 75645301 AVN_DEL_1_A
0 chr1 56369789 56369289 ... 79015167 79014667 AVN_DEL_1_B
Notice how the rows were joined together incorrectly in the final result. I think this is due to this line in the getcoordinates function:
track = pd.DataFrame(data=track, index=[0])
I want to set the index to the variable index that I get when I turn each data frame row into a tuple, but I keep getting the error:
ValueError: Shape of passed values is (8, 6), indices imply (8, 4)
I having a difficult time transitioning from R tidyverse to pandas. So please, go easy on me.
Upvotes: 0
Views: 244
Reputation: 776
Not sure if this is the best way, but it can be done by defining a function as below to create new 2 row for every row in old df
:
def get_new(row, awindow, bwindow):
new_row_A = {}
new_row_A['CHROMOSOME'] = row['CHROMOSOME']
new_row_A['START'] = row['START']-awindow
new_row_A['END'] = row['START']
new_row_A['LABEL'] = 'AVN_DEL_'+str(row.name)+'_A'
new_row_B = {}
new_row_B['CHROMOSOME'] = row['CHROMOSOME']
new_row_B['START'] = row['END']
new_row_B['END'] = row['END']+bwindow
new_row_B['LABEL'] = 'AVN_DEL_'+str(row.name)+'_B'
return [new_row,new_row_B]
then calling this function on every row as below:
awindow = 500
bwindow = 500
new_df = pd.DataFrame()
for new_row in df.apply(lambda row: get_new(row, awindow, bwindow), axis=1):
new_df = new_df.append(pd.DataFrame(new_row))
new_df.reset_index(drop=True, inplace=True)
Upvotes: 1