etotheeyepi
etotheeyepi

Reputation: 35

Join a value onto an existing data.table without an exact match (matching within an interval)

I have a column in a data.table (DT1) that is a decreasing count (let's say pol_count), and another column for the average age in a population (say AverageAge). I am trying to take my value of pol_count (say 400) and my value of AverageAge (say 85) and match them to an interval within a table called FactorFile.

Here is a glimpse of FactorFile:

Count_Greater_Than  Age_Less_Than   Months  Factor
100                 80              12      1
85                  82              16      0.85
65                  84              20      0.65
45                  86              24      0.45

In this instance, pol_count is > 100 so it would return the third column's (Months) value of 12, and AverageAge is < 86 but > 84 so it would return 20. I then need another column to take the max, so final answer is 20. Finally, a column for the Factor associated with 20 which is 65%.

I'm not sure how to join on a table without using exact matches. Basically I want a VLOOKUP with the last argument as TRUE, to pull the closest value to the supplied lookup value.

I've tried using the package fuzzyjoin, here is my statement. It runs, but nothing seems to happen to my data.table:

fuzzy_left_join(DT1, FactorFile, by = c("AverageAge" = "Average_Age_Less_Than"), match_fun = `<=`)

Any tips on how to use fuzzyjoin correctly or an easier way to make this join happen? To summarize, I'm trying to take two columns in DT1, and return both a Months column and Factor column onto DT1, by joining on FactorFile.

Thanks!

Upvotes: 1

Views: 114

Answers (1)

chinsoon12
chinsoon12

Reputation: 25225

Maybe something like this:

DT1[, c("M1", "F1") := 
    FactorFile[.SD, on=.(Count_Greater_Than=pol_count), roll=Inf, .(Months, Factor)]
]

DT1[, c("M2", "F2") := 
    FactorFile[.SD, on=.(Age_Less_Than=AverageAge), roll=Inf, .(Months, Factor)]
]

DT1[, c("M", "Fac") := {
        mm <- pmax(M1, M2)
        .(mm, fifelse(mm==M1, F1, F2))
    }]

output:

   pol_count AverageAge M1 F1 M2   F2  M  Fac
1:       400         85 12  1 20 0.65 20 0.65

data:

library(data.table)
DT1 <- data.table(pol_count=400, AverageAge=85)
FactorFile <- fread("
Count_Greater_Than  Age_Less_Than   Months  Factor
100                 80              12      1
85                  82              16      0.85
65                  84              20      0.65
45                  86              24      0.45")

Upvotes: 1

Related Questions