colla
colla

Reputation: 837

Python pandas column filtering substring

I have a dataframe in python3 using pandas which has a column containing a string with a date.

This is the subset of the column

ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"
"2020-04-08"
"2020-04-12"

I would like to remove the rows that have the same month and day twice and keep the one with the newest year.

This would be what I would expect as a result from this subset

ColA
"2021-04-03"
"2021-04-08"
"2020-04-12"

The last two rows where removed because 2020-04-12 and 2020-04-08 already had the dates in 2021.

I thought of doing this with an apply and lambda but my real dataframe has hundreds of rows and tens of columns so it would not be efficient. Is there a more efficient way of doing this?

Upvotes: 1

Views: 147

Answers (2)

SimonR
SimonR

Reputation: 1824

I'm not sure you can get away from using an 'apply' to extract the relevant part of the date for grouping, but this is much easier if you first convert that column to a pandas datetime type:

df = pd.DataFrame({'colA':
["2021-04-03",
"2021-04-08",
"2020-04-12",
"2020-04-08",
"2020-04-12"]})

df['colA'] = df.colA.apply(pd.to_datetime)

Then you can group by the (day, month) and keep the highest value like so:

df.groupby(df.colA.apply(lambda x: (x.day, x.month))).max()

Upvotes: 0

Richie Youm
Richie Youm

Reputation: 11

There are a couple of ways you can do this. One of them would be to extract the year, sort it by year, and drop rows with duplicate month day pair.

# separate year and month-day pairs
df['year'] = df['ColA'].apply(lambda x: x[:4])
df['mo-day'] = df['ColA'].apply(lambda x: x[5:])
df.sort_values('year', inplace=True)
print(df)

This is what it would look like after separation and sorting:

         ColA  year mo-day
2  2020-04-12  2020  04-12
3  2020-04-08  2020  04-08
4  2020-04-12  2020  04-12
0  2021-04-03  2021  04-03
1  2021-04-08  2021  04-08

Afterwards, we can simply drop the duplicates and remove the additional columns:

# drop duplicate month-day pairs
df.drop_duplicates('mo-day', keep='first', inplace=True)

# get rid of the two columns
df.drop(['year','mo-day'], axis=1, inplace=True)

# since we dropped duplicate, reset the index
df.reset_index(drop=True, inplace=True)
print(df)

Final result:

         ColA
0  2020-04-12
1  2020-04-08
2  2021-04-03

This would be much faster than if you were to convert the entire column to datetime and extract dates, as you're working with the string as is.

Upvotes: 1

Related Questions