Milaa
Milaa

Reputation: 419

subset the data frame based on some criteria in R

In the list, there are three data frames. First, I want to select the data frame of the fewer rows to be my reference frame. And then I want to subset the other data frames based on the minimum distance from the values of the reference data frame. here is the example:

Data

a<- data.frame(name=c("a1","a2","a3","a4"), x=c(10,15,59,21),y=c(12,16,20,30))
  b<- data.frame(name=c("b1","b2","b3","b4","b5"), x=c(8,9,2,-1,13),y=c(7,1,5,10,0))
  c<- data.frame(name=c("c1","c2","c3","c4","c5","c6","c7"), x=c(1,5,6,2,3,10,-8),y=c(2,-3,7,4,6,15,8))
  all<- list(a=a,b=b,c=c)

Here a is chosen as a reference as its nrow=4. now I want to compute the distance as follows

a1b1, a1b2, a1b3, a1b4,a1b5     
a2b1, a2b2, a2b3, a2b4,a2b5
a3b1, a3b2, a3b3, a3b4,a3b5
a4b1, a4b2, a4b3, a4b4,a4b5

which distance is minimum of each row the corresponding will be added to the subset of the data frame b called sub_b as follows:

Expected Result

> sub_b
  name x y
1   b1 8 7
2   b3 2 5
3   b1 8 7
4   b3 2 5

similarly, compute the distance between a and c then subset c based on the minimum distance

# a1c1, a1c2, a1c3, a1c4,a1c5, a1c6, a1c7
# a2c1, a2c2, a2c3, a2c4,a2c5, a2c6, a2c7
# a3c1, a3c2, a3c3, a3c4,a3c5, a3c6, a3c7
# a4c1, a4c2, a4c3, a4c4,a4c5, a4c6, a4c7

and the sub_c data frame should be as

# Expected Result

> sub_c
  name x y
1   c3 6 7
2   c5 3 6
3   c3 6 7
4   c5 3 6

finally, the new list is new.all<- list (a=a, sub_b=sub_b, sub_c=sub_c)

Here is My Code

lessRow<- lapply(all, function(x) nrow(x)) 
  lessRow<- which.min(lessRow)   # set the reference frame 

     A<- matrix(a$x, a$y, ,nrow=4,ncol = 2) # convert data frame to matrices 
     B<- matrix(b$x, b$y, ncol = 2,nrow = 5)
     C<- matrix(c$x, c$y, ncol = 2,nrow = 7)

     library(geosphere)     # compute the distances
     dis.ab<- distm(A, B,distGeo)
     dis.ac<- distm(A, C,distGeo)

# select which points of dataframe b is closest to points a
       minm.ab <- apply(A, 1, function(x) {
        dm <- distm(x, B , fun=distGeo)
        return(which.min(dm))
      }) 
# select which points of dataframe c is closest to points a
       minm.ac<- apply(A, 1, function(x) {
         dm <- distm(x, C , fun=distGeo)
         return(which.min(dm))
       }) 
# subset based on the minmuim distance 
       sub_b<- b[minm.ab,]
       sub_c<- c[minm.ac, ]
# create a new list of new data frames by keeping the reference frame (a) as it is.
       new.all<- list (a=a, sub_b=sub_b, sub_c=sub_c) 

The question is how to do so in the loop as the number of data frames is more than 3.

Upvotes: 0

Views: 62

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388862

We can separate the reference dataframe and remaining dataframe based on number of rows. Then calculate the distance between each row in reference dataframe with the remaining one and get the minimum distance, use that to subset the rows in dataframe and get a list of dataframes.

library(geosphere)

inds <- which.min(sapply(all, nrow))

ref <- all[[inds]]
remaining <- all[-inds]

output <- lapply(remaining, function(x) {
            x[apply(ref[-1], 1, function(y) {
                   which.min(distm(y, as.matrix(x[-1]), fun = distGeo))
               }),]
              })

Combined dataframe :

c(list(ref), output)

Upvotes: 3

Related Questions