Reputation: 275
This is a question about how to compare several columns of two different data frames with varying length.
I have two data frames (data from receiver1 (rec1) and receiver2 (rec2)) of different lengths containing positions for 4 different ships:
rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE),
lon = sample (1:20),
lat = sample (1:10)
)
rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE),
lon = sample (1:30),
lat = sample (1:30)
)
They contain varying names (ship names, same names for both) and longitude (lon) as well as latitude (lat) coordinates.
I am attempting to compare the two dfs to see how many values in both "lon" AND "lat" match for each vessel (i.e. how often the two receivers picked up the same locations)
Basically I am trying to find out how good each receiver is and how much of the datapoints overlap (e.g. percentage).
I am not sure how this is best done and am open for any suggestions. Thanks a lot!!!
Upvotes: 5
Views: 17856
Reputation: 96
The simplest way to make this comparison in base R is with merge
.
Try this:
# Set the RNG so sample() produces the same output and this example is reproducible
set.seed(720)
rec1 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 20, replace = TRUE),
lon = sample (1:20),
lat = sample (1:10)
)
rec2 <- data.frame(name = sample (c("Nina", "Doug", "Alli", "Steve"), 30, replace = TRUE),
lon = sample (1:30),
lat = sample (1:30)
)
merged <- merge(x = rec1,
y = rec2,
by = c("name","lat","lon"))
print(merged)
The merged data frame will contain all of the cases where all three columns match (in this case, one). You could then do something like table(merged$name)
to count the number of times each name appears in the merged data.
Though, your question leaves me wondering... there must be some sort of time element here, yes? If you include the measurement time in your data, you could merge by name and time, then calculate the measured lat and lon differences.
Edit:
I feel I would be remiss if I didn't mention the fabulous dplyr package, which makes analysis like this extremely simple. The above merge and count of unique name values is achieved with this simple one-liner:
inner_join(rec1, rec2) %>% count(name)
Upvotes: 1
Reputation: 491
Here is a modified and reproducible test case together with my answer. I designed the test set to include combinations that will match and some that will not match.
rec1 <- data.frame(shipName = rep(c("Nina", "Doug", "Alli", "Steve"), each = 5),
lon = rep.int(c(1:5), 4),
lat = rep.int(c(11:15), 4)
)
rec2 <- data.frame(shipName = rep(c("Nina", "Doug", "Alli", "Steve"), each = 7),
lon = rep.int(c(2, 3, 4, 4, 5, 5, 6), 4),
lat = rep.int(c(12, 13, 14, 14, 15, 15, 16), 4)
)
print(rec1)
print(rec2)
#Merge the two data frames together, keeping only those combinations that match
m <- merge(rec1, rec2, by = c("shipName", "lon", "lat"), all = FALSE)
print(m)
If you want to count how many times each combination appears, try the following. (There are different ways to aggregate. Some are here. Below is my preferred method, which requires you to have data.table
installed. It's a great tool, so you may want to install it if you haven't yet.)
library(data.table)
#Convert to a data table and optionally set the sort key for faster processing
m <- data.table(m)
setkey(m, shipName, lon, lat)
#Aggregate and create a new column called "Count" with the number of
#observations in each group (.N)
m <- m[, j = list("Count" = .N), by = list(shipName, lon, lat)]
print(m)
#If you want to return to a standard data frame rather than a data table:
m <- data.frame(m)
Upvotes: 5
Reputation: 263332
You didn't construct a very useful test case, but here is an approach:
> both <- rbind(data.frame(grp="A", rec1[, 2:3]), data.frame(grp="B", rec2[, 2:3]))
> with(both, table( duplicated(both[,2:3]), grp))
grp
A B
FALSE 20 30
Upvotes: 2