Reputation: 11

Python Pandas: Number of times the same value appears in different columns within the same row

I have a wide dataset and am trying to figure out how many times the value in column 2019_5 appears for that same member and whether it's continuous or not. The best I've managed to come up with is dataframe['number_yrs'] = 5 - dataframe.isnull().sum(axis=1) which gives the following. The problem is that it just looks at whether there is or isn't a NaN, not whether the value is equal to 2019_5

  member  2015_5  2016_5  2017_5  2018_5  2019_5  number_yrs
0    aaa     NaN     NaN     NaN     NaN      12           1
1    bbb     7.0     7.0     7.0     7.0       7           5
2    ccc    10.0    10.0     NaN     NaN      10           3
3    ddd    12.0     NaN     NaN     7.0       7           3
4    eee    12.0     NaN    10.0     NaN      10           3

What I want is for it to be 2 for member ddd and 2 for member eee

I'd also like to add a continuous column that is y/n and indicates whether number_yrs is continuous. I expect it would look something like this when all is said and done (correctly)

  member  number_yrs continuous
0    aaa           1          n
1    bbb           5          y
2    ccc           3          n
3    ddd           2          y
4    eee           2          n

Upvotes: 1

Answers (2)

Mustafa Aydın

Reputation: 18306

For the number_yrs, you can compare the 2019_5 column against the whole dataframe (but transposed to assure alignment since the latter will be a series):

# keep `member` column aside from calculations
>>> df = df.set_index("member")
>>> df.T == df["2019_5"]

member    aaa   bbb    ccc    ddd    eee
2015_5  False  True   True  False  False
2016_5  False  True   True  False  False
2017_5  False  True  False  False   True
2018_5  False  True  False   True  False
2019_5   True  True   True   True   True

As you can see, this returns a boolean frame where each entry is marked whether they are equal to 2019_5's value per row (and last row is full Trues as expected). Now we can sum the Trues per member to get number_yrs:

>>> (df.T == df["2019_5"]).sum(axis=0)

member
aaa    1
bbb    5
ccc    3
ddd    2
eee    2

As for the continutiy, we can look at 2018_5s value and see if it's the same as that of 2019_5 for each row:

>>> df["2018_5"] == df["2019_5"]

member
aaa    False
bbb     True
ccc    False
ddd     True
eee    False

We're almost there; we can map True/False values to "y"/"n" with map:

>>> (df["2018_5"] == df["2019_5"]).map({True: "y", False: "n"})

member
aaa    n
bbb    y
ccc    n
ddd    y
eee    n

So we have the needed new columns. Putting together and assigning those to the dataframe:

df["number_yrs"] = (df.T == df["2019_5"]).sum(axis=0)
df["continuous"] = (df["2018_5"] == df["2019_5"]).map({True: "y", False: "n"})

gives

>>> df

        2015_5  2016_5  2017_5  2018_5  2019_5  number_yrs continuous
member
aaa        NaN     NaN     NaN     NaN      12           1          n
bbb        7.0     7.0     7.0     7.0       7           5          y
ccc       10.0    10.0     NaN     NaN      10           3          n
ddd       12.0     NaN     NaN     7.0       7           2          y
eee       12.0     NaN    10.0     NaN      10           2          n

Upvotes: 0

Andrej Kesely

Reputation: 195438

Try:

df["number_yrs"] = df.filter(regex="_5$").apply(
    lambda x: x.eq(x["2019_5"]).sum(), axis=1
)
df["continuous"] = np.where(
    df.filter(regex="_5$").apply(
        lambda x: sorted(m := x.eq(x["2019_5"])) == m.tolist() and m.sum() > 1,
        axis=1,
    ),
    "y",
    "n",
)
print(df)

Prints:

  member  2015_5  2016_5  2017_5  2018_5  2019_5  number_yrs continuous
0    aaa     NaN     NaN     NaN     NaN      12           1          n
1    bbb     7.0     7.0     7.0     7.0       7           5          y
2    ccc    10.0    10.0     NaN     NaN      10           3          n
3    ddd    12.0     NaN     NaN     7.0       7           2          y
4    eee    12.0     NaN    10.0     NaN      10           2          n

Upvotes: 2

Python Pandas: Number of times the same value appears in different columns within the same row

Answers (2)

Related Questions