Reputation: 1
I have a very large datatable of ocean sites and multiple depths at each site (table 1). I need to extract rows matching site location and depth in another datatable (Table 2).
Table 1 - table to be subset
Lat | Long | Depth | Nitrate |
---|---|---|---|
165 | -77 | 0 | 29.5420 |
165 | -77 | 50 | 30.2213 |
165 | -77 | 100 | 29.2275 |
124 | -46 | 0 | 27.8544 |
124 | -46 | 50 | 28.6458 |
124 | -46 | 100 | 24.9543 |
76 | -24 | 0 | 31.9784 |
76 | -24 | 50 | 28.6408 |
76 | -24 | 100 | 24.9746 |
25 | -62 | 0 | 31.9784 |
25 | -62 | 50 | 28.6408 |
25 | -62 | 100 | 24.9746 |
Table 2 - co-ordinates and depth needed for subsetting:
Lat | Long | Depth |
---|---|---|
165 | -77 | 100 |
76 | -24 | 50 |
25 | -62 | 0 |
I have tried to get all sites in a table that would include all available depth data for those sites:
subset <- filter(table1, Lat == table2$Lat | Long == table2$Long)
but it returns zero obs.
Any suggestions?
Upvotes: 0
Views: 31
Reputation: 388982
paste
Lat
and Long
columns together of both the tables and select rows that match.
result <- subset(table1, paste(Lat, Long) %in% paste(table2$Lat, table2$Long))
result
Upvotes: 0
Reputation: 160447
It seems you are looking for an inner join:
merge(dat1, dat2, by = c("Lat", "Long"))
# Lat Long Depth.x Nitrate Depth.y
# 1 165 -77 100 29.2275 100
# 2 165 -77 0 29.5420 100
# 3 165 -77 50 30.2213 100
# 4 25 -62 0 31.9784 0
# 5 25 -62 50 28.6408 0
# 6 25 -62 100 24.9746 0
# 7 76 -24 0 31.9784 50
# 8 76 -24 50 28.6408 50
# 9 76 -24 100 24.9746 50
There is some risk in this: joins and such rely on strict equality when comparing columns, but floating-point (with many digits of precision) can become too "fine" for most programming languages to detect differences (c.f., Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). The problem with this problem is that you will get no errors, it just won't produce matches.
To work around that problem, you will need to think about "tolerance", or the distance between points in dat1
and points in dat2
, and what distance between all the points is effectively "close enough" to constitute a join. That can be done in one of two ways: (1) calculate the distance between all points in dat1
and all points in dat2
, and taking the minimum for each in dat1
(and within a tolerance); or (2) do a "fuzzy join" (using the aptly named fuzzyjoin
package) to find points that are within a range (effectively like dat1$Lat between dat2$Lat +/- 0.01
and similar for Long
).
Data
dat1 <- structure(list(Lat = c(165L, 165L, 165L, 124L, 124L, 124L, 76L, 76L, 76L, 25L, 25L, 25L), Long = c(-77L, -77L, -77L, -46L, -46L, -46L, -24L, -24L, -24L, -62L, -62L, -62L), Depth = c(0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L), Nitrate = c(29.542, 30.2213, 29.2275, 27.8544, 28.6458, 24.9543, 31.9784, 28.6408, 24.9746, 31.9784, 28.6408, 24.9746)), class = "data.frame", row.names = c(NA, -12L))
dat2 <- structure(list(Lat = c(165L, 76L, 25L), Long = c(-77L, -24L, -62L), Depth = c(100L, 50L, 0L)), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 1