Saul Feliz
Saul Feliz

Reputation: 668

How to rank column in r data frame based on another column

Supposed I have an R data frame that looks like this:

#sample data frame
df <- data.frame(
customer_id = c(568468,568468,568468,568468,568468,568468),
customer = c('paramount','paramount','paramount','paramount','paramount','paramount'),
start_date = as.Date(c('2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15')),
occured_on = as.POSIXct(c('2017-08-08 20:05:00','2017-08-08 20:30:00','2017-08-11 21:13:00','2017-08-11 21:30:00','2017-08-31 05:16:00','2017-08-31 05:30:00')),
old_plan = c('a',NA,'b',NA,'b',NA),
old_price = c(NA,29,NA,99,NA,82.5),
old_recurrence = c('monthly',NA,'monthly',NA,'annually',NA),
new_plan = c('b',NA,'b',NA,'c',NA),
new_price = c(NA,99,NA,82.5,NA,349),
new_recurrence = c('monthly',NA,'annually',NA,'monthly',NA)
);

Task:

rank the old_plan, old_price, old_recurrence as the first in each group based on the min occured_on time... and the new_plan, new_price, new_recurrence, based on the max occured_on time... such that my resulting data frame would have the first old plan, price and recurrence, and the last new plan price and recurrence. NAs should be removed/not taken into account. The resulting dataframe should look like this:

customer_id  customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
568468 paramount 2016-03-15        a        29        monthly        c       349        monthly

or if you want to see in your code

result_df <- data.frame(
customer_id = 568468,
customer = 'paramount',
start_date = "2016-03-15",
old_plan = 'a',
old_price = 29,
old_recurrence = 'monthly',
new_plan = 'c',
new_price = 349,
new_recurrence = 'monthly'
)

I feel I'm close using functions like these...

df$old_plan_rank <- rank(df$old_plan, na.last = "keep", ties.method = "min")
df$new_recurrence_rank <- rank(df$new_recurrence, na.last = "keep", ties.method = "max")

except it's ranking based on order or alphabetically/numerically, not on the order it actually occurred based on the occurred_on column. I don't know how to specify a column for which to rank on.

Help?

Upvotes: 0

Views: 2152

Answers (1)

www
www

Reputation: 39154

A solution using dplyr.

library(dplyr)

df2 <- df %>%
  arrange(customer_id, start_date, occured_on) %>%
  group_by(customer_id, customer, start_date) %>%
  summarise(old_plan = first(old_plan[!is.na(old_plan)]),
            old_price = first(old_price[!is.na(old_price)]),
            old_recurrence = first(old_recurrence[!is.na(old_recurrence)]),
            new_plan = last(new_plan[!is.na(new_plan)]),
            new_price = last(new_price[!is.na(new_price)]),
            new_recurrence = last(new_recurrence[!is.na(new_recurrence)])) %>%
  ungroup() %>%
  as.data.frame()
df2
#   customer_id  customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
# 1      568468 paramount 2016-03-15        a        29        monthly        c       349        monthly

Explanation

arrange(customer_id, start_date, occured_on) is to sort the columns. It sorts the columns by customer_id, and then, start_date, finally occured_on.

group_by(customer_id, customer, start_date) means to perform the following operation in each group based on customer_id, customer, and start_date.

summarise generates single summary output for each variable.

For each variable, take old_plan as an example, I used old_plan[!is.na(old_plan) to extract the non-NA values of that column. After that, first and last can extract the first or the last element of those values, which corresponding to the minimum and maximum in terms of time.

ungroup() is to remove the grouping. as.data.frame() is optional, which converts the tibble object to strictly data.frame object.

Upvotes: 1

Related Questions