Reputation: 175
I have a dataframe called df
that looks similar to this (except the number of 'Date' columns goes up to Date_8
and there are several hundred clients - I have simplified it here).
Client_ID Date_1 Date_2 Date_3 Date_4
C1019876 relationship no change no change no change
C1018765 no change single no change no change
C1017654 single no change relationship NaN
C1016543 NaN relationship no change single
C1015432 NaN no change single NaN
I want to create two new columns, first_status
and last_status
. first_status
should equal the first given relationship status in the 4 date columns i.e. the first response that is either relationship
or single
, while last_status
should equal the last given relationship status in the 4 date columns. The resulting df
should look like this.
Client_ID Date_1 Date_2 Date_3 Date_4 first_status last_status
C1019876 relationship no change no change no change relationship relationship
C1018765 no change single no change no change single single
C1017654 single no change relationship NaN single relationship
C1016543 NaN relationship no change single relationship single
C1015432 NaN no change single NaN single single
I think these two columns can be created through list comprehension, but I don't know how. For the first_status
column I imagine the code would perform something like the following on every row in the df
:
Date
column where a value is given (filters out NaN)no change
, go to the next Date
columnrelationship
, first_status
= relationship
single
, first_status
= single
For the last_status
column I imagine the code would perform something like the following on every row in the df
:
Date
column where a value is given (filters out NaN)no change
, go to the previous Date
columnrelationship
, last_status
= relationship
single
, last_status
= single
Upvotes: 0
Views: 49
Reputation: 14113
I suppose if you really wanted to use list comprehension you can but the solution from @yatu will be much faster:
# unstack and find the first column index where relationship or single occurs
first = df.unstack().groupby(level=1).apply(lambda x: (np.isin(x.values, ['relationship', 'single'])).argmax())
last = df.unstack()[::-1].groupby(level=1).apply(lambda x: (np.isin(x.values, ['relationship', 'single'])).argmax())
# list comprehension to find the index and column index pair
f_list = [x for x in enumerate(first)]
l_list = [x for x in enumerate(last)]
# list comprehension with iloc
f_val = [df.iloc[f_list[i]] for i in range(len(f_list))]
l_val = [df.loc[:, ::-1].iloc[l_list[i]] for i in range(len(l_list))]
# create columns
df['first'] = f_val
df['last'] = l_val
Client_ID Date_1 Date_2 Date_3 Date_4 \
0 C1019876 relationship no change no change no change
1 C1018765 no change single no change no change
2 C1017654 single no change relationship NaN
3 C1016543 NaN relationship no change single
4 C1015432 NaN no change single NaN
first last
0 relationship relationship
1 single single
2 single relationship
3 relationship single
4 single single
timeit results: 8 ms ± 230 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)
Upvotes: 0
Reputation: 88275
You can use replace
no change
with np.nan
, and select the first and last valid values using bfill
and ffill
respectively:
df = df.replace('no change', np.nan)
df['first_status'] = df.bfill(axis=1).Date_1
df['last_status'] = df.loc[:,:'Date_4'].ffill(axis=1).Date_4
#df = df.fillna('no_change') # if needed
Client_ID Date_1 Date_2 Date_3 Date_4 first_status \
0 C1019876 relationship NaN NaN NaN relationship
1 C1018765 NaN single NaN NaN single
2 C1017654 single NaN relationship NaN single
3 C1016543 NaN relationship NaN single relationship
4 C1015432 NaN NaN single NaN single
last_status
0 relationship
1 single
2 relationship
3 single
4 single
In the case of having Date
columns up to n
, use df.loc[:,:'Date_n'].ffill(axis=1).Date_n
for the last_status
Upvotes: 3