Reputation: 159
Below is my current and desired data set. When date
priority
ID
and revenue
are the same but code
is different I want to only keep the row with the 'highest' code.
The hierarchy for the codes is as follows: B>A>C. If there is any B, does not matter where in the string, it is assigned hierarchy 1.
large_df_have
ID Date Priority Revenue Code V1 V2 V3
1 418 1/01/2020 1 -866 A XX3 XX1 XX3
2 418 1/01/2020 1 -866 AB XX2 XX2 XX3
3 418 1/01/2020 1 -866 A XX3 XX1 XX3
large_df_want
ID Date Priority Revenue Code V1 V2 V3
2 418 1/01/2020 1 -866 AB XX2 XX2 XX3
Upvotes: 1
Views: 285
Reputation: 93813
I think there are still some minor issues to clarify here, but self-plagiarising from this old question: How to reclassify/replace values based on priority when there are repeats , I think it makes sense to use an ordered factor:
library(data.table)
## set the order (small to large)
lev <- c("C","A","B")
setDT(have)
have[, ord := ordered(sapply(strsplit(Code, ""),
function(x) max(ordered(x,levels=lev))), levels=lev)]
have[ have[, which.max(ord), by=.(ID, Date, Priority, Revenue)]$V1, ]
# ID Date Priority Revenue Code V1 V2 V3 ord
#1: 418 1/01/2020 1 -866 AB XX2 XX2 XX3 B
#2: 418 1/01/2020 1 -866 A XX3 XX1 XX3 A
Using this extended data with two groups:
have <- read.table(text="
ID Date Priority Revenue Code V1 V2 V3
418 1/01/2020 1 -866 A XX3 XX1 XX3
418 1/01/2020 1 -866 AB XX2 XX2 XX3
418 1/01/2020 1 -866 A XX3 XX1 XX3
419 1/01/2020 1 -866 A XX3 XX1 XX3
419 1/01/2020 1 -866 A XX2 XX2 XX3
419 1/01/2020 1 -866 C XX3 XX1 XX3
", header=TRUE, stringsAsFactors=FALSE)
Upvotes: 0
Reputation: 1780
You can create a factor, and then use dplyr::distinct()
:
library(dplyr)
df_have <- data.frame(ID = c(418, 418, 418),
Date = c("1/01/2020", "1/01/2020","1/01/2020"),
Priority = c(1, 1, 1),
Revenue = c(-866, -866, -866),
Code = c("A", "AB", "A"),
V1 = c("XX3", "XX2", "XX3"),
V2 = c("XX1", "XX2", "XX1"),
V3 = c("XX3", "XX3", "XX3"))
# Hierarchy of combinations
hierarchy <- c("B", "BA", "AB", "A", "C")
# Create factor
df_have %>%
mutate(Code = factor(Code, levels = hierarchy)) %>%
arrange(ID, Date, Priority, Revenue, Code) %>%
distinct(ID, Date, Priority, Revenue, .keep_all = TRUE)
#> ID Date Priority Revenue Code V1 V2 V3
#> 1 418 1/01/2020 1 -866 AB XX2 XX2 XX3
Created on 2021-05-17 by the reprex package (v2.0.0)
Upvotes: 0
Reputation: 223
Without more context it's hard to provide code which does exactly as required.
Based on your question, two options jump to mind.
Option 1. You want to rank AB the same as B (for example).
Option 2. You want to rank AB differently to B (for example).
Option 1 is obviously problematic since the final row you use will be based on the order it appears in the original dataset. Option 2 might be better if the Code column represents an error. For example, if system with ID 418 has error code A and B, this is worse than just error code B.
library(dplyr)
df_have <- data.frame(ID = c(418, 418, 418),
Date = c("1/01/2020", "1/01/2020","1/01/2020"),
Priority = c(1, 1, 1),
Revenue = c(-866, -866, -866),
Code = c("A", "AB", "A"),
V1 = c("XX3", "XX2", "XX3"),
V2 = c("XX1", "XX2", "XX1"),
V3 = c("XX3", "XX3", "XX3"))
# Option 1. Rank AB the same as B (for example)
df_want.1 <- df_have %>%
# add a numeric score based on the B > A > C ordering
mutate(score = case_when(
grepl("B", Code) ~ 3,
grepl("A", Code) ~ 2,
grepl("C", Code) ~ 1,
)) %>%
# group by Date, Priority, ID, Revenue (since you want the row with the highest code)
group_by(Date, Priority, ID, Revenue) %>%
# only keep the row for the group which has the highest score (or highest code)
filter(score == max(score)) %>%
# AB and B will both produce a score of 3, so we only keep one of the rows in the group
distinct(Date, Priority, ID, Revenue, .keep_all = TRUE) %>%
ungroup()
df_want.1
# Option 2. Rank AB above B (for example)
df_want.2 <- df_have %>%
# add a numeric score based on the B > A > C ordering
mutate(score_b = if_else(grepl("B", Code), 3, 0),
score_a = if_else(grepl("A", Code), 2, 0),
score_c = if_else(grepl("C", Code), 1, 0)) %>%
# group by Date, Priority, ID, Revenue (since you want the row with the highest code)
group_by(Date, Priority, ID, Revenue) %>%
# add each of the scores together
mutate(row_score = score_b + score_a + score_c) %>%
# only keep the row for the group which has the highest score (or highest code combination)
filter(row_score == max(row_score)) %>%
# assuming it's possible to have the same score across the group, only keep first row in the group
distinct(Date, Priority, ID, Revenue, .keep_all = TRUE) %>%
ungroup()
df_want.2
Upvotes: 0
Reputation: 26218
This will do
heirarchy
among the codes as per given conditionlarge_df_have <- read.table(text = ' ID Date Priority Revenue Code V1 V2 V3
1 418 1/01/2020 1 -866 A XX3 XX1 XX3
2 418 1/01/2020 1 -866 AB XX2 XX2 XX3
3 418 1/01/2020 1 -866 A XX3 XX1 XX3', header = T)
library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
str_detect(Code, 'A') ~ 2,
str_detect(Code, 'C') ~ 3,
TRUE ~ 4)) %>%
filter(priority_code == min(priority_code))
#> # A tibble: 1 x 9
#> # Groups: ID, Date, Priority, Revenue [1]
#> ID Date Priority Revenue Code V1 V2 V3 priority_code
#> <int> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 418 1/01/2020 1 -866 AB XX2 XX2 XX3 1
Check it on more complex case
large_df_have <- read.table(text = ' ID Date Priority Revenue Code V1 V2 V3
1 418 1/01/2020 1 -866 A XX3 XX1 XX3
2 418 1/01/2020 1 -866 AB XX2 XX2 XX3
3 418 1/01/2020 1 -866 A XX3 XX1 XX3
4 419 1/01/2020 1 -866 C XX3 XX1 XX3
5 420 1/01/2020 1 -866 A XX3 XX1 XX3
6 420 1/01/2020 1 -866 C XX3 XX1 XX3', header = T)
library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
str_detect(Code, 'A') ~ 2,
str_detect(Code, 'C') ~ 3,
TRUE ~ 4)) %>%
filter(priority_code == min(priority_code))
#> # A tibble: 3 x 9
#> # Groups: ID, Date, Priority, Revenue [3]
#> ID Date Priority Revenue Code V1 V2 V3 priority_code
#> <int> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 418 1/01/2020 1 -866 AB XX2 XX2 XX3 1
#> 2 419 1/01/2020 1 -866 C XX3 XX1 XX3 3
#> 3 420 1/01/2020 1 -866 A XX3 XX1 XX3 2
Created on 2021-05-17 by the reprex package (v2.0.0)
Upvotes: 3
Reputation: 1370
Depending on how many values of Code
you have, you may create an additional numeric column reflecting the hierachy (incase the non-alphabetical order causes confusion). Check out ?ifelse
if you aren't familiar, but the syntax is ifelse(test, yes, no)
whereby if a test
is TRUE
the value returned is that specified by yes
, otherwise it is no
.
large_df_have %>%
mutate(
Code2 = ifelse(Code == 'B', 1, NA), # for the first time we make the Code2 column, 'no' values need to be NA
Code2 = ifelse(Code == 'A', 2, Code2), # rather than the 'no' result being NA as above, we keep any pre-existing values, eg the ones we just made in the line above
Code2 = ifelse(Code == 'C', 3, Code2),
# and for your AB values (or others)
Code2 = ifelse(Code == 'AB', 1.5, Code2)
) %>%
# we create a group, of which we want the highest value of Code
group_by(date, priority, ID, revenue) %>%
# then we use filter() to keep the highest ranking rows for each group
filter(
Code2 == min(Code2)
)
Without your data I can't test it out, but this approach should work.
Upvotes: 0