ARJ
ARJ

Reputation: 2080

Append two new columns to datframe using values from existing column based on conditions

Say, I have a data frame which looks like.

 df.head()
    ID  col1    col2    col3    col4    type
    1   146       91      Rp    Rp-203  ex
    1   146      314      Rp    Rp-203  trans
    1   603       91      Rp    Rp-203  CDS
    1   910       81      Rp    Rp-203  CDS
    1   910       81      Rp    Rp-203  ex
    1   202      825      Rp    Rp-203  CDS
    1   202      837      Rp    Rp-203  ex
    1   200      314      Rp    Rp-203  ex

From the above data frame, I want to generate data frames. The data frame is based on the condition if the type column is equal to ex. In addition to that, the new data frame should have two additional columns with values separated with comma based on the col1 and col2.

I want to generate two columns col5 and col6 using the values from col1 and col2 for each value in col4 and type columns. That is, I want to group by on the column col3 for values (ex) in the column type.

In the end, I am aiming to have data frames as,

ID  col1    col2    col3    col4    ex_start    ex_end
1   146     314     Rp      Rp-203  091,081,837 910,202,200

I have tried the follwoing solution,

df2 = df.loc[df['type']=='ex', ['col3','col1', 'col2',]].groupby(['col3']).agg(
        lambda x: ','.join([str(y) for y in x]))

However, my solution is capturing from the first value of the col1 as ex_start first value. But I need the col2 value as the first value in the ex_start column of df2. And col1 value as the first value of ex_end column in df2 and so on. And the column col1 and col2 in df2, should take the values from df columns col1 and col2 if the column type is equal to trans.

Any help/suggestions are much appreciated!

Upvotes: 0

Views: 89

Answers (2)

Dev Khadka
Dev Khadka

Reputation: 5451

This is my approach with groupby and a function to process

def join(group):
    ex = group[["col1", "col2"]].copy().values
    row = group.iloc[0]
    row[["col1", "col2"]] = (ex[0,0], ex[-1,1])
    row["ex_start"] = ",".join(ex[1:,0].astype(str))
    row["ex_end"] = ",".join(ex[:-1,1].astype(str))

    return row

df.groupby("type").apply(join)

to get row of ex only you can do

df.groupby("type").apply(join).loc[["ex"]]

Output

ID  col1    col2    col3    col4    type    ex_start    ex_end
type                                
CDS 1   603 825 Rp  Rp-203  CDS 910,202 91,81
ex  1   146 314 Rp  Rp-203  ex  910,202,200 91,81,837
trans   1   146 314 Rp  Rp-203  trans       

Upvotes: 1

jezrael
jezrael

Reputation: 862511

Use:

#filter only ex rows by type
df3 = df[df['type']=='ex'].copy()

#shift values per groups from list
df3['s'] = df3.groupby(['ID','col3', 'col4'])['col2'].shift()
#removed NaNs rows per start and convert values to int and strings
df3 = df3.dropna(subset=['s']).assign(ex_start = lambda x: x['s'].astype(int).astype(str),
                                      ex_end = lambda x: x['col1'].astype(str))
print (df3)
   ID  col1  col2 col3    col4 type      s ex_start ex_end
4   1   910    81   Rp  Rp-203   ex   91.0       91    910
6   1   202   837   Rp  Rp-203   ex   81.0       81    202
7   1   200   314   Rp  Rp-203   ex  837.0      837    200

#then aggregate join
df4 = df3.groupby(['ID','col3', 'col4'])['ex_start','ex_end'].agg(','.join).reset_index()
print (df4)
   ID col3    col4   ex_start       ex_end
0   1   Rp  Rp-203  91,81,837  910,202,200

#filter by trans first rows per groups
df5 = df[df['type']=='trans'].drop_duplicates(['ID','col3', 'col4']).drop('type', 1)
print (df5)
   ID  col1  col2 col3    col4
1   1   146   314   Rp  Rp-203

#and add df5
df = df5.merge(df4)
print (df)
   ID  col1  col2 col3    col4   ex_start       ex_end
0   1   146   314   Rp  Rp-203  91,81,837  910,202,200

Upvotes: 1

Related Questions