Pandas comparison with variable number of columns

Question

I'm trying to create a column in pandas based off a comparison between columns that may vary in number and am trying to think of the quickest, cleanest way to do this:

id     date     birth_date_1  birth_date_2
 1   1/1/2000   1/3/2000       1/5/2000
 1   1/7/2000   1/3/2000       1/5/2000
 2   1/2/2000   1/10/2000       1/1/2000
 2   1/5/2000   1/10/2000       1/1/2000
 3   1/4/2000     NaT            NaT

I aim to create a new column that counts number of birth dates before current date:

id     date     birth_date_1  birth_date_2   num_born_before_date
 1   1/1/2000   1/3/2000       1/5/2000             0
 1   1/7/2000   1/3/2000       1/5/2000             2
 2   1/2/2000   1/10/2000       1/1/2000            1
 2   1/5/2000   1/10/2000       1/1/2000            1
 3   1/4/2000     NaT            NaT                0

The caveat is that the number of birth_date columns will vary from run to run. I don't want to iterate over entries as that would be insanely slow...

Edit: Came up with a bit of a dirty hack using np.where. Not sure if there's a better way to do this, especially in terms of dealing with NaTs.

NAT2 = pd.to_datetime('01-01-2100')  # need this to deal with NaTs
df = df.fillna(NAT2)

df['num_born'] = 0
created_cols = [c for c in df.columns if 'birth_date' in c]

for col in created_cols:
    df['num_born'] = np.where((df['date'] >= df[col]),
                              df['num_born'] + 1, df['num_born'])
df = df.replace(to_replace=NAT2, value=pd.NaT)

joris · Accepted Answer

So assuming your dataframe has parsed the datetime columns (you can use to_datetime for that, or eg specify parse_dates in read_csv):

In [64]: df
Out[64]:
   id       date birth_date_1 birth_date_2
0   1 2000-01-01   2000-01-03   2000-01-05
1   1 2000-01-07   2000-01-03   2000-01-05
2   2 2000-01-02   2000-01-10   2000-01-01
3   2 2000-01-05   2000-01-10   2000-01-01

You can now check where the values in the 'birth_date' columns are lower than the values in the 'date' column, and then use sum to count:

In [65]: df[['birth_date_1', 'birth_date_2']].lt(df['date'], axis=0)
Out[65]:
  birth_date_1 birth_date_2
0        False        False
1         True         True
2        False         True
3        False         True

In [66]: df[['birth_date_1', 'birth_date_2']].lt(df['date'], axis=0).sum(axis=1)

Out[66]:
0    0
1    2
2    1
3    1
dtype: int64

To deal with the varying number of 'birth_date' columns, you can do this automatically with filter, like this:

In [67]: df.filter(like="birth_date")
Out[67]:
  birth_date_1 birth_date_2
0   2000-01-03   2000-01-05
1   2000-01-03   2000-01-05
2   2000-01-10   2000-01-01
3   2000-01-10   2000-01-01

Altogether, this would give:

In [66]: df.filter(like="birth_date").lt(df['date'], axis=0).sum(axis=1)

Out[66]:
0    0
1    2
2    1
3    1
dtype: int64

Pandas comparison with variable number of columns

Answers (1)

Related Questions