Reputation: 1237
I have this dataset:
library(data.table)
dt <- data.table(
record=c(1:20),
area=rep(LETTERS[1:4], c(4, 6, 3, 7)),
score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)
and I have used the solution from this post to create this summary:
dt_summary =
dt[ , .N, keyby = .(area, score, cluster)
][ , {
idx = frank(-N, ties.method = 'min') == 1
NN = sum(N)
.(
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
dt_score_1 <- dt_summary[score == 1]
setnames(dt_score_1, "area", "zone")
I would like to use the results from dt_score_1
to filter dt
based on the area/zone and cluster/cluster_mode. So in a new data.table, the only rows taken from dt
for area A should belong to cluster X, for area D they should be cluster Z etc.
Upvotes: 3
Views: 2151
Reputation: 28675
If I'm understanding the question correctly, this is a merge of dt
with dt_score_1
with the conditions area = zone, cluster = cluster_mode
.
dt[dt_score_1, on = .(area = zone, cluster = cluster_mode)]
# record area score cluster i.score cluster_pct cluster_freq record_freq
# 1: 1 A 1 X 1 100.00000 2 2
# 2: 2 A 1 X 1 100.00000 2 2
# 3: 7 B 1 X 1 66.66667 2 3
# 4: 8 B 1 X 1 66.66667 2 3
# 5: 11 C 2 X 1 100.00000 1 1
# 6: 12 C 1 X 1 100.00000 1 1
# 7: 14 D 1 Z 1 80.00000 4 5
# 8: 15 D 1 Z 1 80.00000 4 5
# 9: 16 D 1 Z 1 80.00000 4 5
# 10: 17 D 1 Z 1 80.00000 4 5
# 11: 20 D 3 Z 1 80.00000 4 5
For a more detailed explanation of join-as-filter, see the link below posted by @Frank
Perform a semi-join with data.table
Upvotes: 5