Mark
Mark

Reputation: 2899

Data table subset using a condition, and only specific rows from that subset

I have a big data table that is rendered in a plotly plot, and the user can select points in the scatter plot with a region (rectangular or polygon). After processing this select element, I get a dataframe with a list of the 'cluster' each point in the selection belongs to and which 'nth' element (row) of that cluster it is, i.e.

cluster    elementNr    
1             1    
1             23    
2             7
2             9
2             14

This recreates the output:

selectme <- data.frame(cluster = c(rep(1,20), rep(2,30)), elementNr = c(sample(1:100, 20, replace = F),sample(1:100,30, replace =F)))

And this creates the dummy equivalent of my main data table

mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T), val1 = runif(5000, 0.1, 100), val2 = runif(5000, 0.1, 50))

What I am trying to figure out, is how to subset the data table with the selectme dataframe so that I get the right particles out of there. i.e. as the first example, I should end up with the 1st and 23rd row of the rows where cluster == 1, and the 7th, 9th, 14th row of those where cluster == 2

Expected output:

Cluster   val1    val2
 1        0.45    2.54     (1st of this cluster 
 1        25.73   84.83  (23rd element of this cluster 
 2        0.26    12.76   (7th el)
 2        3.16    26.84   (9th el)
 2        0.76    19.46   (14th el)

So 7th element is not the 7th row of mydata, but the 7th row of the subset where cluster ==2 And because my data can contain dozens of clusters I would like to put the following logical sentence into code: "For each cluster in selectme, find the rows of the subset of mydata where cluster equals the same cluster and grab the rows that equal the elementnrs of selectme for the current cluster, and produce one output containing all the corresponding rows"

I can transfer myselection to data.table first if needed of course

UPDATE: real data added

Real data file, real selectme file, and expected output file in RDS form are found here: datafiles

Upvotes: 1

Views: 111

Answers (2)

chinsoon12
chinsoon12

Reputation: 25208

After creating an index for each row, you can do an inner join using the data.table as follows

set.seed(8L)
selectme <- data.table(cluster = c(rep(1L,20), rep(2L,30)), elementNr = c(sample(1:100, 20, replace = F),sample(1:100,30, replace =F)))
mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T), val1 = runif(5000, 0.1, 100), val2 = runif(5000, 0.1, 50))

selectme[mydata[, rn := .I, by=.(cluster)],
    on=c("cluster", "elementNr"="rn"), nomatch=0L]

Upvotes: 0

De Novo
De Novo

Reputation: 7630

Using the data you provided:

mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T), 
                     val1 = runif(5000, 0.1, 100), 
                     val2 = runif(5000, 0.1, 50))

selectme <- data.frame(cluster = c(rep(1,20), rep(2,30)), 
                       elementNr = c(sample(1:100, 20, replace = F),
                       sample(1:100,30, replace =F)))

# One at a time:
mydata[cluster == 1][selectme[selectme$cluster == 1,]$elementNr]
#    cluster       val1      val2
# 1:       1 92.7082853 34.879758
# 2:       1  6.2921762 19.203557
# 3:       1 32.5258127  3.731329
# 4:       1 93.5793112 34.937362
# 5:       1 62.0401740 42.491657
# and so on...

# Programmatically

clusters <- unique(selectme$cluster)
tobind <- vector("list", length(clusters))
for (i in seq_along(tobind)){
  tobind[[i]] <- mydata[cluster == clusters[i]][selectme[selectme$cluster == clusters[i],]$elementNr]
}

selected <- rbindlist(tobind)

Upvotes: 2

Related Questions