ARJ
ARJ

Reputation: 2080

Changing the columns in DataFrame with respect to values in other columns

I have a data Frame which looks like this,

Head    CHR Start   End Trans   Num 
A   1   29554   30039   ENST473358  1 
A   1   30564   30667   ENST473358  2 
A   1   30976   31097   ENST473358  3 
B   1   36091   35267   ENST417324  1 
B   1   35491   34544   ENST417324  2 
B   1   35184   35711   ENST417324  3 
B   1   36083   35235   ENST461467  1 
B   1   35491   120765  ENST461467  2

And I need to change the Column Start and End with respect to column Trans and Num. Means, the column Trans has values which are repeating which is mentioned in column Num. And so on. Means I want to change Start as -End+10 and End as- Start from next row (which has same Trans) -10 and so on for all rows .So what I am aiming is to get an output which looks like follows,

 Head  CHR   Start  End       Trans    Num 
    A   1   30564   30667   ENST473358  1
    A   1   30976   31097   ENST473358  2
    A   1   30267   NA      ENST473358  3
    B   1   35277   35481   ENST417324  1
    B   1   34554   35174   ENST417324  2
    B   1   35721   NA      ENST417324  3
    B   1   35245   35481   ENST461467  1
    B   1   120775  NA      ENST461467  2

Any help is much appreciated I could do it without considering the Trans with the following script, but I won't get my desired output.

start = df['Start'].copy()
df['Start'] = df.End + 10
df['End'] = ((start.shift(-1) - 10))
df.iloc[-1, df.columns.get_loc('Start')] = ''
df.iloc[-1, df.columns.get_loc('End')] = ''
print (df)

Upvotes: 0

Views: 86

Answers (2)

Michael Lampe
Michael Lampe

Reputation: 674

You may want to consider reindexing your data based on how you want to utilize it.

You can index your data based on the column "Trans" and "Num" like so:

#Change how we index the frame
df.set_index(["Trans", "Num"], inplace=True)

Next, we'll grab each index that is unique so we can replace them all (I'm pretty sure this part and the iteration below can be done in bulk, but I just did this quickly. If you are having efficiency problems look into how to not not loop over all the indexes probably.)

#Get only unique indexes
unique_trans = list(set(df.index.get_level_values('Trans')))

Then we can iterate through and apply what you want.

# Access each index
for trans in unique_trans:

    # Get the higher number in "Num" for each so we know which to set to NaN
    max_num = max(df.ix[trans].index.values)

    # Copy your start column as a temp variable
    start = df.ix[trans]["Start"].copy()

    # Apply the transform to the start column (Equal to end + 10)        
    df.loc[trans, "Start"] = np.array(df.ix[trans]["End"]) + 10

    # Apply the transform to the end column
    df.loc[trans, "End"] = np.array(start.shift(-1) - 10)

    # By passing a tuple as a row index, we get the element that is both in trans and the max number, 
    #which is the one you want to set to NaN
    df.loc[(trans, max_num), "End"] = np.nan

print(df)

The result I got from this when running your data was:

                Head  Chr     Start      End
Trans      Num                             
ENST473358 1      A    1   30049.0  30554.0
           2      A    1   30677.0  30966.0
           3      A    1   31107.0      NaN
ENST417324 1      B    1   35277.0  35481.0
           2      B    1   34554.0  35174.0
           3      B    1   35721.0      NaN
ENST461467 1      B    1   35245.0  35481.0
           2      B    1  120775.0      NaN

The full code I used to generate your test case is this:

import pandas as pd
import numpy as np
# Setup your dataframe
df = pd.DataFrame(columns=["Head", "Chr", "Start", "End", "Trans", "Num"])
df["Head"] = ["A", "A", "A", "B", "B", "B", "B", "B"]
df["Chr"] = [1]*8
df["Start"] = [29554, 30564, 30976, 36091, 35491, 35184, 36083, 35491]
df["End"] = [30039, 30667, 31097, 35267, 34544, 35711, 35235, 120765]
df["Trans"] = ["ENST473358", "ENST473358", "ENST473358",
               "ENST417324", "ENST417324", "ENST417324",
               "ENST461467","ENST461467"]
df["Num"] = [1, 2, 3, 1, 2, 3, 1, 2]

# Change how we index the frame
df.set_index(["Trans", "Num"], inplace=True)

# Get only unique indexes
unique_trans = list(set(df.index.get_level_values('Trans')))

# Access each index
for trans in unique_trans:
    max_num = max(df.ix[trans].index.values)

    start = df.ix[trans]["Start"].copy()
    df.loc[trans, "Start"] = np.array(df.ix[trans]["End"]) + 10
    df.loc[trans, "End"] = np.array(start.shift(-1) - 10)
    df.loc[(trans, max_num), "End"] = np.nan

print(df)

Upvotes: 2

IanS
IanS

Reputation: 16241

You can put your existing code in a function, then group by Trans and apply the function:

def func(df):
    start = df['Start'].copy()
    df['Start'] = df.End + 10
    df['End'] = ((start.shift(-1) - 10))
    df.iloc[-1, df.columns.get_loc('Start')] = ''
    df.iloc[-1, df.columns.get_loc('End')] = ''
    return df

df.groupby('Trans').apply(func)

Result:

  Head  CHR  Start     End       Trans  Num
0    A    1  30677   30966  ENST473358    1
1    A    1  31107   30257  ENST473358    2
2    A    1                 ENST473358    3
3    B    1  35491   34544  ENST417324    1
4    B    1  35184   35711  ENST417324    2
5    B    1                 ENST417324    3
6    B    1  35491  120765  ENST461467    1
7    B    1                 ENST461467    2

Upvotes: 1

Related Questions