Reputation: 257
Suppose I have the following data:
dt <- data.frame(id=c(1,1,2,2,3,3,3,4,5,5,5,5,6,7,7),
rk=c("a","a","b","b","c","y","c","d","e","y","e","e","f","g","h"),
.id=c("df1", "df9", "df5", "df16", "df2", "df11", "df11", "df4", "df9", "df4", "df6", "df3", "df16", "df2", "df9"))
So my data looks like this:
id rk .id
1 a df1
1 a df9
2 b df5
2 b df16
3 c df2
3 y df11
3 c df11
4 d df4
5 e df9
5 y df4
5 e df6
5 e df3
6 f df16
7 g df2
7 h df9
But I only want one row per pair of id and rk. So in the example id=5 can have two rows: one with rk=e and one with rk=y.
To find the right row to keep I look to the .id column. Here I prioritize in the following order:
df2, df9, df1, df5, df4, df6, df15, df17, df16, df14, df8, df11, df3, df7, df12, df13, df10
So I would always keep a row with .id=df2 over a row with .id=df9. Likewise I would always keep a row with .id=df15 over a row with .id=df14.
Note, that the order is not chronological.
Back to my example data, this is what I would like to end up with:
id rk .id
1 a df9
2 b df5
3 c df2
3 y df11
4 d df4
5 e df9
5 y df4
6 f df16
7 g df2
7 h df9
My dataset is huge, so I hope some of you can help me writing some code that makes this easy peasy.
Upvotes: 4
Views: 140
Reputation: 4773
I would do it with data.table
like this. Looks slightly long but fairly intuitive.
library(data.table)
# Load datasets
dt <- data.frame(id=c(1,1,2,2,3,3,3,4,5,5,5,5,6,7,7),
rk=c("a","a","b","b","c","y","c","d","e","y","e","e","f","g","h"),
.id=c("df1", "df9", "df5", "df16", "df2", "df11", "df11", "df4", "df9", "df4", "df6", "df3", "df16", "df2", "df9"))
Priority_List <- c("df2", "df9", "df1", "df5", "df4", "df6", "df15", "df17", "df16",
"df14", "df6", "df8", "df11", "df3", "df7", "df12", "df13", "df10")
# Create a data table called priority list with the priority rank
Priority_List <- data.table(.id = Priority_List , Priority = 1:length(Priority_List))
# Convert your parent data.frame into data.table
dt <- data.table(dt)
# Merge the Priority List with dt based on .id
dt <- merge(dt,Priority_List, by =c(".id"), all.x = TRUE)
# Find the minimum priority for each id and rk
dt <- dt[, Min_Priority := min(Priority), by = c("id", "rk")]
# Filter when Priority is equal to the Min_Priority for a particular id, rk
dt <- dt[Min_Priority == Priority]
# Take unique in case there are duplicate rows.
dt <- unique(dt)
# Remove unwanted columns and order based on id and rk
dt <- dt[,.(id, rk, .id)][order(id, rk)]
Hope this helps.
Upvotes: 0
Reputation: 389047
With dplyr
we can group_by
id
and rk
and get the first match
of .id
with new_order
.
library(dplyr)
dt %>%
group_by(id, rk) %>%
summarise(.id = .id[which.min(match(.id, new_order))])
# id rk .id
# <dbl> <fct> <fct>
# 1 1.00 a df9
# 2 2.00 b df5
# 3 3.00 c df2
# 4 3.00 y df11
# 5 4.00 d df4
# 6 5.00 e df9
# 7 5.00 y df4
# 8 6.00 f df16
# 9 7.00 g df2
#10 7.00 h df9
Equivalent, base R aggregate
option is
aggregate(.id~id+rk, dt, function(x) x[which.min(match(x, new_order))])
If there are some other columns which we want to keep, we can use filter
instead of summarise
dt %>%
group_by(id, rk) %>%
filter(.id == .id[which.min(match(.id, new_order))])
whose equivalent ave
option would be
dt[with(dt, .id == ave(.id, id, rk, FUN = function(x)
x[which.min(match(x, new_order))])), ]
where ,
new_order <- c("df2", "df9", "df1", "df5", "df4", "df6", "df15", "df17", "df16",
"df14", "df6", "df8", "df11", "df3", "df7", "df12", "df13", "df10")
Upvotes: 6