I have a very large datatable of ocean sites and multiple depths at each site (table 1). I need to extract rows matching site location and depth in another datatable (Table 2). Table 1 - table to be subset Lat Long Depth Nitrate 165 -77 0 29.5420 165 -77 50 30.2213 165 -77 100 29.2275 124 -46 0 27.8544 124 -46 50 28.6458 124 -46 100 24.9543 76 -24 0 31.9784 76 -24 50 28.6408 76 -24 100 24.9746 25 -62 0 31.9784 25 -62 50 28.6408 25 -62 100 24.9746 Table 2 - co-ordinates and depth needed for subsetting: Lat Long Depth 165 -77 100 76 -24 50 25 -62 0 I have tried to get all sites in a table that would include all available depth data for those sites: subset <- filter(table1, Lat == table2$Lat | Long == table2$Long) but it returns zero obs. Any suggestions?

Reputation: 1

subset a datatable using multiple criteria from second table

I have a very large datatable of ocean sites and multiple depths at each site (table 1). I need to extract rows matching site location and depth in another datatable (Table 2).

Table 1 - table to be subset

Lat	Long	Depth	Nitrate
165	-77	0	29.5420
165	-77	50	30.2213
165	-77	100	29.2275
124	-46	0	27.8544
124	-46	50	28.6458
124	-46	100	24.9543
76	-24	0	31.9784
76	-24	50	28.6408
76	-24	100	24.9746
25	-62	0	31.9784
25	-62	50	28.6408
25	-62	100	24.9746

Table 2 - co-ordinates and depth needed for subsetting:

Lat	Long	Depth
165	-77	100
76	-24	50
25	-62	0

I have tried to get all sites in a table that would include all available depth data for those sites:

subset <- filter(table1, Lat == table2$Lat | Long == table2$Long)

but it returns zero obs.

Any suggestions?

Upvotes: 0

Answers (2)

Ronak Shah

Reputation: 388982

paste Lat and Long columns together of both the tables and select rows that match.

result <- subset(table1, paste(Lat, Long) %in% paste(table2$Lat, table2$Long))
result

Upvotes: 0

r2evans

Reputation: 160447

It seems you are looking for an inner join:

merge(dat1, dat2, by = c("Lat", "Long"))
#   Lat Long Depth.x Nitrate Depth.y
# 1 165  -77     100 29.2275     100
# 2 165  -77       0 29.5420     100
# 3 165  -77      50 30.2213     100
# 4  25  -62       0 31.9784       0
# 5  25  -62      50 28.6408       0
# 6  25  -62     100 24.9746       0
# 7  76  -24       0 31.9784      50
# 8  76  -24      50 28.6408      50
# 9  76  -24     100 24.9746      50

There is some risk in this: joins and such rely on strict equality when comparing columns, but floating-point (with many digits of precision) can become too "fine" for most programming languages to detect differences (c.f., Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). The problem with this problem is that you will get no errors, it just won't produce matches.

To work around that problem, you will need to think about "tolerance", or the distance between points in dat1 and points in dat2, and what distance between all the points is effectively "close enough" to constitute a join. That can be done in one of two ways: (1) calculate the distance between all points in dat1 and all points in dat2, and taking the minimum for each in dat1 (and within a tolerance); or (2) do a "fuzzy join" (using the aptly named fuzzyjoin package) to find points that are within a range (effectively like dat1$Lat between dat2$Lat +/- 0.01 and similar for Long).

Data

dat1 <- structure(list(Lat = c(165L, 165L, 165L, 124L, 124L, 124L, 76L, 76L, 76L, 25L, 25L, 25L), Long = c(-77L, -77L, -77L, -46L, -46L, -46L, -24L, -24L, -24L, -62L, -62L, -62L), Depth = c(0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L), Nitrate = c(29.542, 30.2213, 29.2275, 27.8544, 28.6458, 24.9543, 31.9784, 28.6408, 24.9746, 31.9784, 28.6408, 24.9746)), class = "data.frame", row.names = c(NA, -12L))
dat2 <- structure(list(Lat = c(165L, 76L, 25L), Long = c(-77L, -24L, -62L), Depth = c(100L, 50L, 0L)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 1

subset a datatable using multiple criteria from second table

Answers (2)

Related Questions