Reputation: 97

Group by and keep the entire row which has max value across different columns

I have a df that looks like:

# Groups:   telefono1 [360]
   nombre telefono1 telefono2 ads_ao ads_am ads_mas ads_ma total_price_ao total_price_am total_price_mas~ total_price_ma
   <chr>  <chr>     <lgl>     <chr>   <dbl> <chr>      <chr>  <chr>                   <dbl> <chr>            <chr>         
 1 CRIST~ 60***0400 NA        0           0 0          0      0                           0 0                0             
 2 CRIST~ 60***0400 NA        0          61 0          0      0                           0 0                0             
 3 CRIST~ 60***0400 NA        0           0 0          0      0                           0 0                0             
 4 DIEGO~ 60***0844 NA        0           0 0          0      0                           0 0                0             
 5 Diego  60***0844 FALSE     0           0 0          2      0                           0 0                22500         
 6 TALLE~ 60***4848 NA        2           0 0          0      59000                       0 0                0             
 7 TALLE~ 60***4848 NA        0         311 0          0      0                           0 0                0             
 8 Rub?n  60***4848 FALSE     0           0 0          2      0                           0 0                59000

I need to group by telefono1 and then keep the entire row which value across columns "ads_ao, ads,am, ads_mas and ads_ma" was the maximum value.

In the example, the right solution were rows #2, #5 and #7 because they are the ones, groupping by telefono1, that have the maximum value across mentioned columns.

I tried it with dplyr

  group_by(telefono1) %>%
  mutate(max_value = max(ads_ao, ads_am, ads_mas, ads_ma)) %>%
  ungroup()

But with this method I get a new variable with max value in all rows and then I can't use a top_n() to get the row with the max value because all are equal.

Please, anyone know any solution? Thanks!!

Upvotes: 2

Answers (2)

akrun

Reputation: 887118

If it is to filter the row with the max values across columns that starts with 'ads', create the 'max_value' column(it is just to show the steps more clearly) with pmax (vectorized option to get the max per row), then grouped by 'telefono1', use slice_max on the 'max_value' column to return only that row where the 'max_value' is max, ungroup and remove the 'max_value' column

library(dplyr)
library(purrr)
df1 %>%
   mutate(max_value = invoke(pmax, select(., starts_with('ads')))) %>%
   group_by(telefono1) %>%
   slice_max(max_value) %>%
   # or use 
   # slice(which.max(max_value)) %>%
   ungroup %>%
   select(-max_value)  
# A tibble: 3 x 11
#  nombre telefono1 telefono2 ads_ao ads_am ads_mas ads_ma total_price_ao total_price_am total_price_mas. total_price_ma
#  <chr>      <int> <lgl>      <int>  <int>   <int>  <int>          <int>          <int>            <int>          <int>
#1 CRIST~ 600040400 NA             0     61       0      0              0              0                0              0
#2 Diego  600530844 FALSE          0      0       0      2              0              0                0          22500
#3 TALLE~ 600674848 NA             0    311       0      0              0              0                0              0

If the dataset is already grouped, ungroup it first before creating the max_value column

df1 %>%
   ungroup %>%
   mutate(max_value = invoke(pmax, select(., starts_with('ads')))) %>%   
   group_by(telefono1) %>%
   slice(which.max(max_value))

Or using base R

mx <- do.call(pmax, df1[startsWith(names(df1), 'ads')])
subset(df1, mx == ave(mx, telefono1, FUN  = max))
#  nombre telefono1 telefono2 ads_ao ads_am ads_mas ads_ma total_price_ao total_price_am total_price_mas. total_price_ma
#2 CRIST~ 600040400        NA      0     61       0      0              0              0                0              0
#5  Diego 600530844     FALSE      0      0       0      2              0              0                0          22500
#7 TALLE~ 600674848        NA      0    311       0      0              0              0                0              0

data

df1 <- structure(list(nombre = c("CRIST~", "CRIST~", "CRIST~", "DIEGO~", 
"Diego", "TALLE~", "TALLE~", "Rub?n"), telefono1 = c(600040400L, 
600040400L, 600040400L, 600530844L, 600530844L, 600674848L, 600674848L, 
600674848L), telefono2 = c(NA, NA, NA, NA, FALSE, NA, NA, FALSE
), ads_ao = c(0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L), ads_am = c(0L, 
61L, 0L, 0L, 0L, 0L, 311L, 0L), ads_mas = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L), ads_ma = c(0L, 0L, 0L, 0L, 2L, 0L, 0L, 2L), total_price_ao = c(0L, 
0L, 0L, 0L, 0L, 59000L, 0L, 0L), total_price_am = c(0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L), total_price_mas. = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L), total_price_ma = c(0L, 0L, 0L, 0L, 22500L, 0L, 0L, 
59000L)), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5", "6", "7", "8"))

Upvotes: 1

tmfmnk

Reputation: 39858

One dplyr and purrr option could be:

df %>%
 group_by(telefono1) %>%
 slice_max(pmap_dbl(across(starts_with("ads")), max), 1)

  nombre telefono1 telefono2 ads_ao ads_am ads_mas ads_ma total_price_ao total_price_am total_price_mas. total_price_ma
  <chr>      <int> <lgl>      <int>  <int>   <int>  <int>          <int>          <int>            <int>          <int>
1 CRIST~ 600040400 NA             0     61       0      0              0              0                0              0
2 Diego  600530844 FALSE          0      0       0      2              0              0                0          22500
3 TALLE~ 600674848 NA             0    311       0      0              0              0                0              0

Upvotes: 1

Group by and keep the entire row which has max value across different columns

Answers (2)

data

Related Questions