Reputation: 2899
I have a big data table that is rendered in a plotly plot, and the user can select points in the scatter plot with a region (rectangular or polygon). After processing this select element, I get a dataframe with a list of the 'cluster' each point in the selection belongs to and which 'nth' element (row) of that cluster it is, i.e.
cluster elementNr
1 1
1 23
2 7
2 9
2 14
This recreates the output:
selectme <- data.frame(cluster = c(rep(1,20), rep(2,30)), elementNr = c(sample(1:100, 20, replace = F),sample(1:100,30, replace =F)))
And this creates the dummy equivalent of my main data table
mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T), val1 = runif(5000, 0.1, 100), val2 = runif(5000, 0.1, 50))
What I am trying to figure out, is how to subset the data table with the selectme dataframe so that I get the right particles out of there. i.e. as the first example, I should end up with the 1st and 23rd row of the rows where cluster == 1, and the 7th, 9th, 14th row of those where cluster == 2
Expected output:
Cluster val1 val2
1 0.45 2.54 (1st of this cluster
1 25.73 84.83 (23rd element of this cluster
2 0.26 12.76 (7th el)
2 3.16 26.84 (9th el)
2 0.76 19.46 (14th el)
So 7th element is not the 7th row of mydata, but the 7th row of the subset where cluster ==2 And because my data can contain dozens of clusters I would like to put the following logical sentence into code: "For each cluster in selectme, find the rows of the subset of mydata where cluster equals the same cluster and grab the rows that equal the elementnrs of selectme for the current cluster, and produce one output containing all the corresponding rows"
I can transfer myselection to data.table first if needed of course
UPDATE: real data added
Real data file, real selectme file, and expected output file in RDS form are found here: datafiles
Upvotes: 1
Views: 111
Reputation: 25208
After creating an index for each row, you can do an inner join using the data.table as follows
set.seed(8L)
selectme <- data.table(cluster = c(rep(1L,20), rep(2L,30)), elementNr = c(sample(1:100, 20, replace = F),sample(1:100,30, replace =F)))
mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T), val1 = runif(5000, 0.1, 100), val2 = runif(5000, 0.1, 50))
selectme[mydata[, rn := .I, by=.(cluster)],
on=c("cluster", "elementNr"="rn"), nomatch=0L]
Upvotes: 0
Reputation: 7630
Using the data you provided:
mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T),
val1 = runif(5000, 0.1, 100),
val2 = runif(5000, 0.1, 50))
selectme <- data.frame(cluster = c(rep(1,20), rep(2,30)),
elementNr = c(sample(1:100, 20, replace = F),
sample(1:100,30, replace =F)))
# One at a time:
mydata[cluster == 1][selectme[selectme$cluster == 1,]$elementNr]
# cluster val1 val2
# 1: 1 92.7082853 34.879758
# 2: 1 6.2921762 19.203557
# 3: 1 32.5258127 3.731329
# 4: 1 93.5793112 34.937362
# 5: 1 62.0401740 42.491657
# and so on...
# Programmatically
clusters <- unique(selectme$cluster)
tobind <- vector("list", length(clusters))
for (i in seq_along(tobind)){
tobind[[i]] <- mydata[cluster == clusters[i]][selectme[selectme$cluster == clusters[i],]$elementNr]
}
selected <- rbindlist(tobind)
Upvotes: 2