Reputation: 11
I have a wide dataset and am trying to figure out how many times the value in column 2019_5
appears for that same member and whether it's continuous or not. The best I've managed to come up with is dataframe['number_yrs'] = 5 - dataframe.isnull().sum(axis=1)
which gives the following. The problem is that it just looks at whether there is or isn't a NaN, not whether the value is equal to 2019_5
member 2015_5 2016_5 2017_5 2018_5 2019_5 number_yrs
0 aaa NaN NaN NaN NaN 12 1
1 bbb 7.0 7.0 7.0 7.0 7 5
2 ccc 10.0 10.0 NaN NaN 10 3
3 ddd 12.0 NaN NaN 7.0 7 3
4 eee 12.0 NaN 10.0 NaN 10 3
What I want is for it to be 2
for member ddd
and 2
for member eee
I'd also like to add a continuous
column that is y/n and indicates whether number_yrs
is continuous. I expect it would look something like this when all is said and done (correctly)
member number_yrs continuous
0 aaa 1 n
1 bbb 5 y
2 ccc 3 n
3 ddd 2 y
4 eee 2 n
Upvotes: 1
Views: 377
Reputation: 18306
For the number_yrs
, you can compare the 2019_5
column against the whole dataframe (but transposed to assure alignment since the latter will be a series):
# keep `member` column aside from calculations
>>> df = df.set_index("member")
>>> df.T == df["2019_5"]
member aaa bbb ccc ddd eee
2015_5 False True True False False
2016_5 False True True False False
2017_5 False True False False True
2018_5 False True False True False
2019_5 True True True True True
As you can see, this returns a boolean frame where each entry is marked whether they are equal to 2019_5
's value per row (and last row is full True
s as expected). Now we can sum the True
s per member
to get number_yrs
:
>>> (df.T == df["2019_5"]).sum(axis=0)
member
aaa 1
bbb 5
ccc 3
ddd 2
eee 2
As for the continutiy, we can look at 2018_5
s value and see if it's the same as that of 2019_5
for each row:
>>> df["2018_5"] == df["2019_5"]
member
aaa False
bbb True
ccc False
ddd True
eee False
We're almost there; we can map True
/False
values to "y"
/"n"
with map
:
>>> (df["2018_5"] == df["2019_5"]).map({True: "y", False: "n"})
member
aaa n
bbb y
ccc n
ddd y
eee n
So we have the needed new columns. Putting together and assigning those to the dataframe:
df["number_yrs"] = (df.T == df["2019_5"]).sum(axis=0)
df["continuous"] = (df["2018_5"] == df["2019_5"]).map({True: "y", False: "n"})
gives
>>> df
2015_5 2016_5 2017_5 2018_5 2019_5 number_yrs continuous
member
aaa NaN NaN NaN NaN 12 1 n
bbb 7.0 7.0 7.0 7.0 7 5 y
ccc 10.0 10.0 NaN NaN 10 3 n
ddd 12.0 NaN NaN 7.0 7 2 y
eee 12.0 NaN 10.0 NaN 10 2 n
Upvotes: 0
Reputation: 195438
Try:
df["number_yrs"] = df.filter(regex="_5$").apply(
lambda x: x.eq(x["2019_5"]).sum(), axis=1
)
df["continuous"] = np.where(
df.filter(regex="_5$").apply(
lambda x: sorted(m := x.eq(x["2019_5"])) == m.tolist() and m.sum() > 1,
axis=1,
),
"y",
"n",
)
print(df)
Prints:
member 2015_5 2016_5 2017_5 2018_5 2019_5 number_yrs continuous
0 aaa NaN NaN NaN NaN 12 1 n
1 bbb 7.0 7.0 7.0 7.0 7 5 y
2 ccc 10.0 10.0 NaN NaN 10 3 n
3 ddd 12.0 NaN NaN 7.0 7 2 y
4 eee 12.0 NaN 10.0 NaN 10 2 n
Upvotes: 2