V Lowe
V Lowe

Reputation: 1

subset a datatable using multiple criteria from second table

I have a very large datatable of ocean sites and multiple depths at each site (table 1). I need to extract rows matching site location and depth in another datatable (Table 2).

Table 1 - table to be subset

Lat Long Depth Nitrate
165 -77 0 29.5420
165 -77 50 30.2213
165 -77 100 29.2275
124 -46 0 27.8544
124 -46 50 28.6458
124 -46 100 24.9543
76 -24 0 31.9784
76 -24 50 28.6408
76 -24 100 24.9746
25 -62 0 31.9784
25 -62 50 28.6408
25 -62 100 24.9746

Table 2 - co-ordinates and depth needed for subsetting:

Lat Long Depth
165 -77 100
76 -24 50
25 -62 0

I have tried to get all sites in a table that would include all available depth data for those sites:

subset <- filter(table1, Lat == table2$Lat | Long == table2$Long)

but it returns zero obs.

Any suggestions?

Upvotes: 0

Views: 31

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

paste Lat and Long columns together of both the tables and select rows that match.

result <- subset(table1, paste(Lat, Long) %in% paste(table2$Lat, table2$Long))
result

Upvotes: 0

r2evans
r2evans

Reputation: 160447

It seems you are looking for an inner join:

merge(dat1, dat2, by = c("Lat", "Long"))
#   Lat Long Depth.x Nitrate Depth.y
# 1 165  -77     100 29.2275     100
# 2 165  -77       0 29.5420     100
# 3 165  -77      50 30.2213     100
# 4  25  -62       0 31.9784       0
# 5  25  -62      50 28.6408       0
# 6  25  -62     100 24.9746       0
# 7  76  -24       0 31.9784      50
# 8  76  -24      50 28.6408      50
# 9  76  -24     100 24.9746      50

There is some risk in this: joins and such rely on strict equality when comparing columns, but floating-point (with many digits of precision) can become too "fine" for most programming languages to detect differences (c.f., Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). The problem with this problem is that you will get no errors, it just won't produce matches.

To work around that problem, you will need to think about "tolerance", or the distance between points in dat1 and points in dat2, and what distance between all the points is effectively "close enough" to constitute a join. That can be done in one of two ways: (1) calculate the distance between all points in dat1 and all points in dat2, and taking the minimum for each in dat1 (and within a tolerance); or (2) do a "fuzzy join" (using the aptly named fuzzyjoin package) to find points that are within a range (effectively like dat1$Lat between dat2$Lat +/- 0.01 and similar for Long).


Data

dat1 <- structure(list(Lat = c(165L, 165L, 165L, 124L, 124L, 124L, 76L, 76L, 76L, 25L, 25L, 25L), Long = c(-77L, -77L, -77L, -46L, -46L, -46L, -24L, -24L, -24L, -62L, -62L, -62L), Depth = c(0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L), Nitrate = c(29.542, 30.2213, 29.2275, 27.8544, 28.6458, 24.9543, 31.9784, 28.6408, 24.9746, 31.9784, 28.6408, 24.9746)), class = "data.frame", row.names = c(NA, -12L))
dat2 <- structure(list(Lat = c(165L, 76L, 25L), Long = c(-77L, -24L, -62L), Depth = c(100L, 50L, 0L)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 1

Related Questions