Alex Nesta
Alex Nesta

Reputation: 413

pandas - Iterate dataframe rows, modify them, and rebuild a data frame in a for loop pandas

I may be making this a lot harder than it has to be.

data frame looks like this:

CHROMOSOME START END
CHR1       100   200
CHR2       300   400

my goal is to make a data frame from this with 4 rows that looks like this.

CHROMOSOME START END LABEL
CHR1       150   250 ROW_1_A
CHR1       170   270 ROW_1_B
CHR2       350   300 ROW_2_A
CHR2       370   400 ROW_2_B

So I need to take each row, split it into A and B, and modify start and end, and then label the row A or B and rebuild this back into a data frame.

Here is my function to split, modify, and label a single row.

def getcoordinates(df, awindow = 500, bwindow = 500):

    index = df[0]
    chromosome = df[1]
    start = df[2]
    end = df[3]
    sv_length = df[8]

    track = {'CHROMOSOME': chromosome,
            'START': start,
            'END': end}

    track = pd.DataFrame(data=track, index=[0])

    trackA = track.copy()
    trackB = track.copy()

    trackA = trackA.assign(LABEL = ("AVN_DEL_" + str(index) + "_A"))
    trackB = trackB.assign(LABEL = ("AVN_DEL_" + str(index) + "_B"))

    trackA = trackA.assign(END = trackA["START"])
    trackA = trackA.assign(START = trackA["START"] - awindow)

    trackB = trackB.assign(START = trackB["END"])
    trackB = trackB.assign(END = trackB["END"] + bwindow)

    return trackA.append(trackB)

here is my for loop to perform this on each row in the dataframe and reassemble.

appended_data = []
for row in SV.itertuples():
    print(row)
    out = getcoordinates(row)
    appended_data.append(out)

appended_data = pd.concat(appended_data, axis=1)

here is the actual code being run.

appended_data = []
for row in SV.itertuples():
    print(row)
    out = getcoordinates(row)
    appended_data.append(out)
appended_data = pd.concat(appended_data, axis=1)
Pandas(Index=0, CHROMOSOME=u'chr1', START=56365453, END=56369289, SV_TYPE=u'DEL', CALLERS=u'GROM;delly;manta;lumpy', LEFT_JUNCTION=u'L1M', RIGHT_JUNCTION=u'L1M', SV_LENGTH=3836, _9=u'DGV', FULL_INFO_ABOUT_ME=u'4_L1MC4_56365281_56365445_92_2.4;L1HS_56365452_56369282_101_2.63;L1HS_56365452_56369282_93_2.42;L1MC4_56369289_56369625_100_2.61')
Pandas(Index=1, CHROMOSOME=u'chr1', START=75645801, END=79014667, SV_TYPE=u'DEL', CALLERS=u'GROM;manta;lumpy', LEFT_JUNCTION=u'L1P', RIGHT_JUNCTION=u'L1P', SV_LENGTH=3368866, _9=u' ', FULL_INFO_ABOUT_ME=u'2_L1PA5_75644642_75646421_300_0.01;L1PA4_79013861_79016088_300_0.01')
appended_data.head()
  CHROMOSOME       END     START     ...            END     START        LABEL
0       chr1  56365453  56364953     ...       75645801  75645301  AVN_DEL_1_A
0       chr1  56369789  56369289     ...       79015167  79014667  AVN_DEL_1_B

Notice how the rows were joined together incorrectly in the final result. I think this is due to this line in the getcoordinates function:

track = pd.DataFrame(data=track, index=[0])

I want to set the index to the variable index that I get when I turn each data frame row into a tuple, but I keep getting the error:

ValueError: Shape of passed values is (8, 6), indices imply (8, 4)

I having a difficult time transitioning from R tidyverse to pandas. So please, go easy on me.

Upvotes: 0

Views: 244

Answers (1)

Kumar
Kumar

Reputation: 776

Not sure if this is the best way, but it can be done by defining a function as below to create new 2 row for every row in old df:

def get_new(row, awindow, bwindow):                                
    new_row_A = {}         
    new_row_A['CHROMOSOME'] = row['CHROMOSOME']                        
    new_row_A['START'] = row['START']-awindow
    new_row_A['END'] = row['START']
    new_row_A['LABEL'] = 'AVN_DEL_'+str(row.name)+'_A'
    new_row_B = {}
    new_row_B['CHROMOSOME'] = row['CHROMOSOME']
    new_row_B['START'] = row['END']
    new_row_B['END'] = row['END']+bwindow
    new_row_B['LABEL'] = 'AVN_DEL_'+str(row.name)+'_B'
    return [new_row,new_row_B]

then calling this function on every row as below:

awindow = 500
bwindow = 500
new_df = pd.DataFrame()
for new_row in df.apply(lambda row: get_new(row, awindow, bwindow), axis=1):
    new_df = new_df.append(pd.DataFrame(new_row))
new_df.reset_index(drop=True, inplace=True)

Upvotes: 1

Related Questions