Kumar
Kumar

Reputation: 188

how to subset data using two other dataframes in R

I have a dataframe with ID's (100) and each ID has different number of rows and all ID's have same number of columns.

the sample dataframe looks like as follows

a <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2), 
              A = c(12,12.5,15,16,18,20,25,26,29,35, 12,12.5,15,16,18,20,25,26,29,35),
              B = c(20,19,18,17,16,20,25,28,30,35, 20,19,18,17,16,20,25,28,30,35),
              C = c(2,1,5,9,10,11,13,18,25,27,2,1,5,9,10,11,13,18,25,27))

in each ID I want to subset the data between two specified values, these two values are in two other dataframes respectively.

the first value is the first row of each ID and the sample dataframe is as follows

 b <- data.frame(ID = c(1,2), 
              A = c(12.0,12.0),
              B = c(20,20),
              C = c(2,2))

the second value is the specified row based on some value in a vector and the sample dataframe is as follows

c <- data.frame(ID = c(1,2), 
               A = c(25.0,20.0),
               B = c(25,20),
               C = c(13,11))

if we can observe, the rows corresponding to each ID, the values are same as in the main dataframe 'a'

the expected dataframe is as follows

d <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,2,2,2), 
              A = c(12,12.5,15,16,18,20,25, 12,12.5,15,16,18,20),
              B = c(20,19,18,17,16,20,25, 20,19,18,17,16,20),
              C = c(2,1,5,9,10,11,13,2,1,5,9,10,11))

to get the expected output, I have tried the following code.....but failed

for (i in 1:nrow(b)){
Azimuth[i] = (a[which(a$A == b$A[i]):which(a$A == c$A[i])])
}

here, I am trying to use two dataframes 'b' and 'c' to subset the data from 'a'. but is it possible to get the same output without using dataframe 'b'?! because in dataframe 'b', each row is the first row in each ID from dataframe 'a'

Upvotes: 0

Views: 243

Answers (1)

ekoam
ekoam

Reputation: 8844

A dplyr solution

library(dplyr)
a %>% 
  mutate(end = FALSE) %>% 
  rows_update(c %>% mutate(end = TRUE), by = c("ID", "A", "B", "C")) %>% 
  group_by(ID) %>% 
  slice(1:which(end)) %>% 
  select(-end)

Output:

# A tibble: 13 x 4
# Groups:   ID [2]
      ID     A     B     C
   <dbl> <dbl> <dbl> <dbl>
 1     1  12      20     2
 2     1  12.5    19     1
 3     1  15      18     5
 4     1  16      17     9
 5     1  18      16    10
 6     1  20      20    11
 7     1  25      25    13
 8     2  12      20     2
 9     2  12.5    19     1
10     2  15      18     5
11     2  16      17     9
12     2  18      16    10
13     2  20      20    11

Explanation:

I guess you want to use a dataframe to subset another one because you want to subset a only if there exist certain combinations of ID, A, B and C, which you specify in your dataframe c?

If that is the case, your goal can be achieved by taking the following steps:

  1. We create another logical variable in a. Call it end and default to FALSE.
  2. We also create the same variable in c but set its default to TRUE.
  3. We use end in c to update the end in a for each row marked by a combination of ID, A, B and C. In this way, the variable end will become TRUE only when there is a full match between a and c for the other four variables. If you cannot find a full match, then you will get this Error: Attempting to update missing rows.
  4. We group_by(ID) and select from the first row until where end is TRUE for each group defined by ID.
  5. You drop that end variable since it has no use any more.

Upvotes: 2

Related Questions