user3071282
user3071282

Reputation: 33

Compare two dataframes to find missing values within a range

I have the following 2 data.frames:

a1 <- data.frame(count = 1:10, site = "a")
a2 <-data.frame(count = 2:8, site = "b")
a4 <- data.frame(count = c(3,6,7,9), site = "a")
a5 <- data.frame(count = c(3,5,7), site = "b")
data1 <- rbind(a1, a2)
data2 <- rbind(a4, a5)

I want to generate a dataframe listing the rows in data1 that are not found in data2 based on the column "count". There are many ways to do this such as:

method1 <- rbind(data1, data2)
method1[!duplicated(method1,fromLast = FALSE)&!duplicated(method1,fromLast = TRUE),] 

Which returns:

   count site
1      1    a
2      2    a
4      4    a
5      5    a
8      8    a
10    10    a
11     2    b
13     4    b
15     6    b
17     8    b

Or something like this (as a side question, this line of code isn't working perfectly and I would appreciate any insight because I have been using this command in several other data processing steps):

method2 <- data1[!data1$count %in% data2$count,]

Anyways, the main question is: How can I return the missing rows only within the range of counts in data2? I.e., I want the output to be:

   count site
4      4    a
5      5    a
8      8    a
13     4    b
15     6    b

I have hundreds of dataframes, each containing many sites with highly variable ranges.

This is my first time posting, thanks for taking the time to read my question and I appreciate any suggested solutions.

Upvotes: 3

Views: 3695

Answers (2)

sparrow
sparrow

Reputation: 1101

I think his does what you want "listing the rows in data1 that are not found in data2 based on the column "count":

method2 <- data1[!data1$count %in% data2$count,]
result <- subset(method2, method2$count > min(data2$count) & 
                 method2$count < max(data2$count))

Your code

method2 <- data1[!data1$count %in% data2$count,]

does what it is supposed to do, but it seems like that you not only want to exclude variables based on count but based on a combination of count and site...?

EDIT: OK, so it seems that you do want to take into account site as well, and not just count. In this case, is this what you want?

method2 <- data1[! interaction(data1$count, data1$site) %in%
                     interaction(data2$count,data2$site),]
result <- subset(method2, method2$count > min(data2$count) & 
               method2$count < max(data2$count))

EDIT2: Now the values are within the range defined by each site.

method2[apply(method2, 1, function(x)  
                 as.numeric(x["count"]) > min(data2$count[data2$site == x["site"]]) &
                 as.numeric(x["count"]) < max(data2$count[data2$site == x["site"]]) )
        ,]

Upvotes: 2

Could you do something like:

data1[ setdiff(data1$count, data2$count),]

Doesn't this give the rows in data1 that are not in data2?

Upvotes: 0

Related Questions