Reputation: 185
I have a dataset with multiple duplicate IDs which have different categorical values. Following is an example data set.
suppressMessages(library(dplyr))
DUMMY_DATA <- data.frame(ID = c(11,22,22,33,33,33,44,44,55,55,55,55),
CATEGORY1 = c("E","B","C","C","C","D","A","A","B","C","E","B"),
CATEGORY2 = c ("AA","AA","BB","CC","DD","BB","AA","EE","AA","CC","BB","EE"),
stringsAsFactors = FALSE)
> DUMMY_DATA
ID CATEGORY1 CATEGORY2
1 11 E AA
2 22 B AA
3 22 C BB
4 33 C CC
5 33 C DD
6 33 D BB
7 44 A AA
8 44 A EE
9 55 B AA
10 55 C CC
11 55 E BB
12 55 B EE
I want to aggregate values of ID from another dataset which gives the rank of categorical values. AS follows.
Category_Rank1 <- data.frame(VAR = c("A","B","C","D","E"),
RANK = c(1,2,3,4,5),stringsAsFactors = FALSE
)
> Category_Rank1
VAR RANK
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
Category_Rank2 <- data.frame(VAR = c("AA","BB","CC","DD","EE"),
RANK = c(1,2,3,4,5),stringsAsFactors = FALSE
)
> Category_Rank2
VAR RANK
1 AA 1
2 BB 2
3 CC 3
4 DD 4
5 EE 5
For each group of IDs from DUMMY_DAT I want to look up the Category_Rank and then alot that category to the ID which has the best rank. Following is my solution.
hierarchyTransform <- function(x,dataset){
x <- unique(x)
dataset <- dataset%>%
filter(dataset[,1] %in% x)
dataset <- dataset%>%
filter(dataset[,2] == min(dataset[,2]))
return(dataset[1,1])
}
NEW_DATA <- DUMMY_DATA%>%
group_by(ID)%>%
summarise(CATEGORY1_CLEAN = hierarchyTransform(x=CATEGORY1,
dataset = Category_Rank1),
CATEGORY2_CLEAN = hierarchyTransform(x=CATEGORY2,
dataset = Category_Rank2))
I get the Following Result.
> NEW_DATA
# A tibble: 5 × 3
ID CATEGORY1_CLEAN CATEGORY2_CLEAN
<dbl> <chr> <chr>
1 11 E AA
2 22 B AA
3 33 C BB
4 44 A AA
5 55 B AA
This is exactly what I want but the problem is time taken for this operation. My Original Data set has around 1 million rows and when I group it based on ID I get about 200,000 groups. So the hierarchyTransform function is applied for 200,000 groups which takes about 15 mins for a single variable and I have to perform this operation for 10 other variables which increases the time. Is there any solution to reduce the time taken for this operation.
Upvotes: 1
Views: 472
Reputation: 93861
If you know the rank order of the levels of CATEGORY
(which is alphabetic in your example) then you can turn CATEGORY
into a factor with the levels ordered according to the desired ranking. Then sort by CATEGORY
, group by ID
, and take the first row for each ID
.
DUMMY_DATA$CATEGORY = factor(DUMMY_DATA$CATEGORY, levels=LETTERS[1:5], ordered=TRUE)
DUMMY_DATA %>%
arrange(ID, CATEGORY) %>%
group_by(ID) %>%
slice(1)
ID CATEGORY 1 11 E 2 22 B 3 33 C 4 44 A 5 55 B
UPDATE: To respond to your comment and updated question: The code below will, for each ID
, select the value of highest rank from each category column.
DUMMY_DATA$CATEGORY1 = factor(DUMMY_DATA$CATEGORY1, levels=LETTERS[1:5], ordered=TRUE)
DUMMY_DATA$CATEGORY2 = factor(DUMMY_DATA$CATEGORY2, levels=c("AA","BB","CC","DD","EE"), ordered=TRUE)
Now you can do either of the following:
DUMMY_DATA %>% group_by(ID) %>%
summarise(CATEGORY1 = min(CATEGORY1),
CATEGORY2 = min(CATEGORY2))
DUMMY_DATA %>% group_by(ID) %>%
summarise_all(funs(min))
ID CATEGORY1 CATEGORY2 1 11 E AA 2 22 B AA 3 33 C BB 4 44 A AA 5 55 B AA
Upvotes: 2