Martin
Martin

Reputation: 63

How to find the lowest value in a data frame matching specific criterias and return a specific column

Firstly, I'm pretty new to R and I'm also new with asking questions on SO, so bear with me if I'm asking stupid questions or am not following SO conventions.

I'm trying to find the best subscription type for multiple users based on their former calling behaviour. Until now I've managed to approx. match 98'000 rows to calculate the effective costs with a (variable) number of subscription types.

effective costs

There's also a data frame with the predicted costs per subscription type for every month: predicted costs

Now I'm trying to find the best subscription type where the costs are lower than the current one for every user and month. I'll do a merge to show which results I'd expect:

comparison

So, in month 2019-01 the costs of subscription_2 are lower than the current subscription for User1, therefore subscription2 should be the recommendation. For months 2019-02 and 2019-03 there is no recommendation as there are no subscription types with lower costs.

For User2 subscription type subscription_3 should be the recommendation in all months, since these costs are always lower than the current subscription.

I'm currently following the courses on DataCamp.com and I am pretty sure that this is almost certainly a VERY basic action in r, but I need someone to guide me into the right direction.

This is what I have so far:

library(dplyr)

effective.costs <- data.frame(
  user = c(rep("User1", 3), rep("User2", 3)),
  month = c(rep(c("2019-01", "2019-02", "2019-03"), 2)),
  current_subscription = c(rep("subscription_1", 3), rep("subscription_2", 3)),
  costs = c(70, 20, 50, 150, 130, 170)
)

predicted.costs <- data.frame(
  user = c(rep("User1", 9), rep("User2", 9)),
  month = c(rep("2019-01",3), rep("2019-02", 3), rep("2019-03", 3)),
  subscription = c(rep(c("subscription_1", "subscription_2", "subscription_3"), 6)),
  calculated_costs = c(
    c(70, 50, 110, 20, 50, 70, 50, 80, 120), 
    c(190, 150, 110, 210, 130, 110, 250, 170, 110)
    )
)

comparison <- merge(effective.costs, predicted.costs, by = c("user", "month"))

getRecommendation <- function(x) {
  subscription <- predicted.costs %>% 
    filter(
      calculated_costs < x['costs'] & 
      user == x['user'] & 
      month == x['month']
    ) %>%
    arrange(calculated_costs) %>%
    select(subscription) 
  subscription <- ifelse(
    length(subscription) > 0, 
    as.character(subscription[1, 1]), 
    NA
  )
  # I know return is not needed, but I'm used to it... :-)
  return(subscription)
}

effective.costs$recommendation <- apply(effective.costs, 1, getRecommendation)

View(effective.costs)

The most important part here is probably the function getRecommendation:

getRecommendation <- function(x) {
  subscription <- predicted.costs %>% 
    filter(
      calculated_costs < x['costs'] & 
      user == x['user'] & 
      month == x['month']
    ) %>%
    arrange(calculated_costs) %>%
    select(subscription) 
  subscription <- ifelse(
    length(subscription) > 0, 
    as.character(subscription[1, 1]), 
    NA
  )
  # I know return is not needed, but I'm used to it... :-)
  return(subscription)
}

which I'm trying to apply to every row in effective.costs:

effective.costs$recommendation <- apply(effective.costs, 1, getRecommendation)

While this gives me the correct output for User2, I currently believe this to be coincident since there's no recommendation for User1, even if there should be one for month 2019-01:

wrong result

Could someone please push me into the right direction?

Thank you!

Upvotes: 2

Views: 1054

Answers (1)

Cole
Cole

Reputation: 11255

This does away with the apply and the getRecommendation function. R is vectorized so we should try to think column-wise as much as we can.

comparison <- merge(effective.costs, predicted.costs, by = c("user", "month"))

comparison%>%
  mutate(net_savings = calculated_costs-costs)%>%
  group_by(user, month)%>%
  filter(net_savings == min(net_savings))%>%
  slice(1) #for ties

The issue with your apply() function is that apply() coerces the data.frame to a matrix. A matrix can only have one class type - in this case, you're comparing a number to a string in calculated_costs < x['costs'].

Specifically, the evaluation is calculated_costs < ' 50' where there is an extra space for 2-digit numbers. For whatever reason, 50 < ' 70' evaluates FALSE whereas 110 < '190' evaluates TRUE.

The solution is to approach the problem differently in this case. There's no need to do rowwise operations via apply.

Upvotes: 1

Related Questions