seagullnutkin
seagullnutkin

Reputation: 321

pandas group by a column, find min value of multiple columns, and create new column for the min row in group

I am trying to create a new column in a pandas dataframe. I want the column to have the value "yes" for a specific groupcol value if the datecol column for that row is the earliest date. However, some groups have duplicate datecol values, so if there are duplicates of the earliest datecol I want to then use the earliest date from the column publishcol. At the end, each group value based off groupcol should only have one row where there is a new column with the value "yes".

This is what I have:

| groupcol | namecol | datecol    | publishcol |
| -------- | ------- | ---------- | ---------- |
|  A       | Bob     | 2020-01-01 | 2020-01-01 |
|  A       | Ralph   | 2020-01-01 | 2020-01-04 |
|  B       | Carl    | 2020-04-04 | 2020-04-04 |
|  B       | Joe     | 2020-04-04 | 2020-05-05 |
|  B       | Fred    | 2020-03-04 | 2020-07-21 |

This is what I want:

| groupcol | namecol | datecol    | publishcol | keep |
| -------- | ------- | ---------- | ---------- | ---- |
|  A       | Bob     | 2020-01-01 | 2020-01-01 | yes  |
|  A       | Ralph   | 2020-01-01 | 2020-01-04 | no   |
|  B       | Carl    | 2020-04-04 | 2020-04-04 | no   |
|  B       | Joe     | 2020-04-04 | 2020-05-05 | no   |
|  B       | Fred    | 2020-03-04 | 2020-07-21 | yes  |

Right now this is what I'm doing:

test = pd.DataFrame({"groupcol": ["A", "A", "B", "B", "B"],
             "namecol": ["Bob", "Ralph", "Carl", "Joe", "Fred"],
              "datecol": ["2020-01-01", "2020-01-01", "2020-04-04", "2020-04-04", "2020-03-04"],
              "publishcol": ["2020-01-01", "2020-01-04", "2020-04-04", "2020-05-05", "2020-07-21"]
            })

# get min based off datecol
test['checkone']=np.where(
    test.datecol == test.groupby('groupcol')['datecol'].transform(min), 
    'want','drop')

# get min based off publish col
test['checktwo']=np.where(
    test.publishcol == test.groupby(['groupcol', 'datecol'])['publishcol'].transform(min), 
    'want','drop')

# get the final col
test['keep'] = np.where((test.checkone == "want") & (test.checktwo == "want"),
"yes", "no")

This gets me what I want, but it seems like a tedious way to do it. Is there a better way to do this?

Upvotes: 2

Views: 725

Answers (1)

Ben.T
Ben.T

Reputation: 29635

You can go using sort_values on groupcol and the two dates columns. use the groupcol to check where not equal (ne) to the shift of itself that will give True on the min per groupcol.

s = test.sort_values(['groupcol', 'datecol', 'publishcol'])['groupcol']
test['keep'] = s.ne(s.shift())
print(test)
  groupcol namecol     datecol  publishcol   keep
0        A     Bob  2020-01-01  2020-01-01   True
1        A   Ralph  2020-01-01  2020-01-04  False
2        B    Carl  2020-04-04  2020-04-04  False
3        B     Joe  2020-04-04  2020-05-05  False
4        B    Fred  2020-03-04  2020-07-21   True

Note that I would keep a boolean column instead of the 'yes' and 'no'. but you can still use test['keep'] = s.ne(s.shift()).map({True:'yes', False:'no'}) if you really want these values

Upvotes: 2

Related Questions