Reputation: 31
I have a table with a list of transactions across a range of different campaigns (or projects). What I'm trying to do is derive when each campaign was started based on the day of the first transaction in each. The data covers all transactions across all campaigns with only following two columns being relevant: Campaign.ID (numeric), Created.At (yyyy-mm-dd). This is what I have at the moment:
temp$Recency = as.integer(today() - temp$Created.At) # Get number of days ago
df = temp %>% # Multiple transactions per user - we only need the most recent
group_by(Campaign.ID) %>% # Group transactions together by campaign
slice(which.max(temp$Recency)) # Find first transaction - ie biggest Recency value
df = as.data.frame(df)
So I create the temp$Recency
column to change the date column to an integer of days from today. I then group the data based on their campaign ID and try to select only the first transaction for each ie. the one with the largest value in temp$Recency
What's puzzling me is that this code works fine if I use which.min
for finding the last transaction of each campaign and returns a data frame with a single (most recent) instance for each campaign. Yet when I try it with which.max
it returns an empty data frame.
I don't understand the reason for this, surely if which.min
works as expected then which.max
should too. I've tried searching for an answer but haven't come across anyone with the same issue.
The data looks something like this:
Created.At Campaign.ID Recency
2018-06-21 1883 13
2018-06-21 1890 13
2018-06-20 1883 14
There are about ~3000 records across 50-60 different campaigns
which.min returns something like this:
Created.At Campaign.ID Recency
2017-07-02 19 367
2017-05-25 91 405
2017-06-06 344 393
2017-06-30 451 369
2017-06-30 509 369
2017-08-16 551 322
I need the same thing but for which.max
Upvotes: 1
Views: 2408
Reputation: 11878
You should remove the temp$
from your slice()
call.
As it is, which.max(temp$Recency)
will return the index of the maximum value in the entire data (not one for each Campaign.ID
), which then is out of bounds for any of the individual groups defined by Campaign.ID
(I'm surprised this doesn't throw an error).
The expression in slice()
is evaluated in the context of the data. However, as your data don't contain a column called temp
, when the call is evaluated, temp$Recency
gets found in the global environment instead. As a result, the code that you have is equivalent to:
max_row <- which.max(temp$Recency)
temp %>%
group_by(Campaign.ID) %>%
slice(max_row)
Then any groups with less than max_row
observations just get left out.
However, if you write slice(which.max(Recency))
, then Recency
is found as a column in the, now grouped data, and is correctly evaluated in each group.
So, to get the result you are after, just write:
df %>%
group_by(Campaign.ID) %>%
slice(which.max(Recency))
as @LAP commented.
As to why which.min
seemed to work, it's probably because the row with the global minimum is likely one of the first (maybe even the first?), so that most (if not all) of the groups have at least that many rows. I.e. if which.min(temp$Recency) == 1
you would always just get the first row in each subgroup.
Upvotes: 1