Reputation: 69
I'd like to make a kind of vlookup. As you know in Excel vlookup function takes first value from the data. On the other hand left_join function works similarly. But when the first data is not unique on lookup values, left_join function dublicate de values. I'd like to do same think with excel without making unique the first dataset. I dont want to manupilate any of the data. Just want to take first looked value.
For example. In this case we see Title A is dublicated in Data3 and also StartYear looks the same with EvaYear :(
I want to find EvaYear in between StartDate and EndDate. If EvaYear equel to EndDate and next period's StartDate. I'd like to take first row. That's why I had to use >= & <= in my formula at the same time..
Title <- c("A","A","A","B","B")
StartYear <- c(2000,2005,2008,2010,2012)
EndYear <- c(2005,2008,2010,2012,2015)
Score <- c(100,75,80,95,79)
Data1 <- data.frame(Title,StartYear,EndYear,Score)
Title <- c("A","B")
EvaYear <- c(2008,2015)
Data2 <- data.frame(Title,EvaYear)
setDT(Data2)
Data3 <- setDT(Data1)[Data2,.(Title,StartYear,EndYear,EvaYear,Score),on=.(Title,StartYear<=EvaYear,EndYear>=EvaYear)]
After run the query I get:
Title StartYear EndYear EvaYear Score
A 2008 2008 2008 75
A 2008 2008 2008 80
B 2015 2015 2015 79
But I need to get:
Title StartYear EndYear EvaYear Score
A 2005 2008 2008 75
B 2012 2015 2015 79
Upvotes: 2
Views: 107
Reputation: 42592
As mentioned by r2evans, non-equi joins pick the columns from the second data.table.
Columns of the first data.table can be referred to using the prefix x.
:
setDT(Data1)[Data2, .(Title, x.StartYear, x.EndYear, EvaYear, Score),
on = .(Title, StartYear <= EvaYear, EndYear >= EvaYear)]
Title x.StartYear x.EndYear EvaYear Score 1: A 2005 2008 2008 75 2: A 2008 2010 2008 80 3: B 2012 2015 2015 79
The duplicate entries can be removed by mult = "first"
as suggested by jangorecki:
setDT(Data1)[Data2, .(Title, StartYear = x.StartYear, EndYear = x.EndYear, EvaYear, Score),
on = .(Title, StartYear <= EvaYear, EndYear >= EvaYear), mult = "first"]
Title StartYear EndYear EvaYear Score 1: A 2005 2008 2008 75 2: B 2012 2015 2015 79
I have also renamed the "x." columns to reproduce OP's expected result.
Upvotes: 2
Reputation: 160952
There are a couple of things to address. (1) You are not seeing 2012 in your actual output, solved by which frame is first in your join. (2) Removing near-duplicates.
The first is because it is that in inequality joins, it is keeping the columns from the other frame. Try swapping them.
Data3 <- Data2[Data1, .(Title, StartYear, EndYear, EvaYear, Score), on = .(Title, EvaYear >= StartYear, EvaYear <= EndYear)]
Data3
# Title StartYear EndYear EvaYear Score
# 1: A 2000 2005 2000 100
# 2: A 2005 2008 2005 75
# 3: A 2008 2010 2008 80
# 4: B 2010 2012 2010 95
# 5: B 2012 2015 2012 79
As you can see, though, EvaYear
is changed, matching the StartYear
that it matched. Let's try to fix that:
Data3 <- Data2[, EY := EvaYear][Data1, .(Title, StartYear, EndYear, EvaYear, Score), on = .(Title, EY >= StartYear, EY <= EndYear)
][ !is.na(EvaYear),
][ order(Score),
][, .SD[1,], by = .(Title, EvaYear)]
Data3
# Title EvaYear StartYear EndYear Score
# 1: A 2008 2005 2008 75
# 2: B 2015 2012 2015 79
Upvotes: 4