Reputation: 321
I am trying to create a new column in a pandas dataframe. I want the column to have the value "yes" for a specific groupcol
value if the datecol
column for that row is the earliest date. However, some groups have duplicate datecol
values, so if there are duplicates of the earliest datecol
I want to then use the earliest date from the column publishcol
. At the end, each group value based off groupcol
should only have one row where there is a new column with the value "yes".
This is what I have:
| groupcol | namecol | datecol | publishcol |
| -------- | ------- | ---------- | ---------- |
| A | Bob | 2020-01-01 | 2020-01-01 |
| A | Ralph | 2020-01-01 | 2020-01-04 |
| B | Carl | 2020-04-04 | 2020-04-04 |
| B | Joe | 2020-04-04 | 2020-05-05 |
| B | Fred | 2020-03-04 | 2020-07-21 |
This is what I want:
| groupcol | namecol | datecol | publishcol | keep |
| -------- | ------- | ---------- | ---------- | ---- |
| A | Bob | 2020-01-01 | 2020-01-01 | yes |
| A | Ralph | 2020-01-01 | 2020-01-04 | no |
| B | Carl | 2020-04-04 | 2020-04-04 | no |
| B | Joe | 2020-04-04 | 2020-05-05 | no |
| B | Fred | 2020-03-04 | 2020-07-21 | yes |
Right now this is what I'm doing:
test = pd.DataFrame({"groupcol": ["A", "A", "B", "B", "B"],
"namecol": ["Bob", "Ralph", "Carl", "Joe", "Fred"],
"datecol": ["2020-01-01", "2020-01-01", "2020-04-04", "2020-04-04", "2020-03-04"],
"publishcol": ["2020-01-01", "2020-01-04", "2020-04-04", "2020-05-05", "2020-07-21"]
})
# get min based off datecol
test['checkone']=np.where(
test.datecol == test.groupby('groupcol')['datecol'].transform(min),
'want','drop')
# get min based off publish col
test['checktwo']=np.where(
test.publishcol == test.groupby(['groupcol', 'datecol'])['publishcol'].transform(min),
'want','drop')
# get the final col
test['keep'] = np.where((test.checkone == "want") & (test.checktwo == "want"),
"yes", "no")
This gets me what I want, but it seems like a tedious way to do it. Is there a better way to do this?
Upvotes: 2
Views: 725
Reputation: 29635
You can go using sort_values
on groupcol and the two dates columns. use the groupcol to check where not equal (ne
) to the shift
of itself that will give True on the min per groupcol.
s = test.sort_values(['groupcol', 'datecol', 'publishcol'])['groupcol']
test['keep'] = s.ne(s.shift())
print(test)
groupcol namecol datecol publishcol keep
0 A Bob 2020-01-01 2020-01-01 True
1 A Ralph 2020-01-01 2020-01-04 False
2 B Carl 2020-04-04 2020-04-04 False
3 B Joe 2020-04-04 2020-05-05 False
4 B Fred 2020-03-04 2020-07-21 True
Note that I would keep a boolean column instead of the 'yes' and 'no'. but you can still use test['keep'] = s.ne(s.shift()).map({True:'yes', False:'no'})
if you really want these values
Upvotes: 2