FGreen
FGreen

Reputation: 175

Creating 2 new columns by iterating through every row in multiple other columns using list comprehension

I have a dataframe called df that looks similar to this (except the number of 'Date' columns goes up to Date_8 and there are several hundred clients - I have simplified it here).

Client_ID  Date_1        Date_2        Date_3        Date_4
C1019876   relationship  no change     no change     no change
C1018765   no change     single        no change     no change    
C1017654   single        no change     relationship  NaN        
C1016543   NaN           relationship  no change     single
C1015432   NaN           no change     single        NaN

I want to create two new columns, first_status and last_status. first_status should equal the first given relationship status in the 4 date columns i.e. the first response that is either relationship or single, while last_status should equal the last given relationship status in the 4 date columns. The resulting df should look like this.

Client_ID  Date_1        Date_2        Date_3        Date_4        first_status  last_status
C1019876   relationship  no change     no change     no change     relationship  relationship 
C1018765   no change     single        no change     no change     single        single    
C1017654   single        no change     relationship  NaN           single        relationship   
C1016543   NaN           relationship  no change     single        relationship  single 
C1015432   NaN           no change     single        NaN           single        single

I think these two columns can be created through list comprehension, but I don't know how. For the first_status column I imagine the code would perform something like the following on every row in the df:

For the last_status column I imagine the code would perform something like the following on every row in the df:

Upvotes: 0

Views: 49

Answers (2)

It_is_Chris
It_is_Chris

Reputation: 14113

I suppose if you really wanted to use list comprehension you can but the solution from @yatu will be much faster:

# unstack and find the first column index where relationship or single occurs
first = df.unstack().groupby(level=1).apply(lambda x: (np.isin(x.values, ['relationship', 'single'])).argmax())
last = df.unstack()[::-1].groupby(level=1).apply(lambda x: (np.isin(x.values, ['relationship', 'single'])).argmax())

# list comprehension to find the index and column index pair
f_list = [x for x in enumerate(first)]
l_list = [x for x in enumerate(last)]

# list comprehension with iloc
f_val = [df.iloc[f_list[i]] for i in range(len(f_list))]
l_val = [df.loc[:, ::-1].iloc[l_list[i]] for i in range(len(l_list))]

# create columns
df['first'] = f_val 
df['last'] = l_val

  Client_ID        Date_1        Date_2        Date_3     Date_4  \
0  C1019876  relationship     no change     no change  no change   
1  C1018765     no change        single     no change  no change   
2  C1017654        single     no change  relationship        NaN   
3  C1016543           NaN  relationship     no change     single   
4  C1015432           NaN     no change        single        NaN   

          first          last  
0  relationship  relationship  
1        single        single  
2        single  relationship  
3  relationship        single  
4        single        single  

timeit results: 8 ms ± 230 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)

Upvotes: 0

yatu
yatu

Reputation: 88275

You can use replace no change with np.nan, and select the first and last valid values using bfill and ffill respectively:

df = df.replace('no change', np.nan)
df['first_status'] = df.bfill(axis=1).Date_1
df['last_status'] = df.loc[:,:'Date_4'].ffill(axis=1).Date_4
#df = df.fillna('no_change') # if needed

 Client_ID        Date_1        Date_2        Date_3  Date_4  first_status  \
0  C1019876  relationship           NaN           NaN     NaN  relationship   
1  C1018765           NaN        single           NaN     NaN        single   
2  C1017654        single           NaN  relationship     NaN        single   
3  C1016543           NaN  relationship           NaN  single  relationship   
4  C1015432           NaN           NaN        single     NaN        single   

    last_status  
0  relationship  
1        single  
2  relationship  
3        single  
4        single  

In the case of having Date columns up to n, use df.loc[:,:'Date_n'].ffill(axis=1).Date_n for the last_status

Upvotes: 3

Related Questions