Seqenenre Tao
Seqenenre Tao

Reputation: 69

Left_Join without dublicate in R

I'd like to make a kind of vlookup. As you know in Excel vlookup function takes first value from the data. On the other hand left_join function works similarly. But when the first data is not unique on lookup values, left_join function dublicate de values. I'd like to do same think with excel without making unique the first dataset. I dont want to manupilate any of the data. Just want to take first looked value.

For example. In this case we see Title A is dublicated in Data3 and also StartYear looks the same with EvaYear :(

I want to find EvaYear in between StartDate and EndDate. If EvaYear equel to EndDate and next period's StartDate. I'd like to take first row. That's why I had to use >= & <= in my formula at the same time..

Title <- c("A","A","A","B","B")
StartYear <- c(2000,2005,2008,2010,2012)
EndYear <- c(2005,2008,2010,2012,2015)
Score <- c(100,75,80,95,79)
Data1 <- data.frame(Title,StartYear,EndYear,Score)

Title <- c("A","B")
EvaYear <- c(2008,2015)
Data2 <- data.frame(Title,EvaYear)

setDT(Data2)
Data3 <- setDT(Data1)[Data2,.(Title,StartYear,EndYear,EvaYear,Score),on=.(Title,StartYear<=EvaYear,EndYear>=EvaYear)]

After run the query I get:

  Title   StartYear   EndYear   EvaYear  Score
     A        2008        2008      2008    75
     A        2008        2008      2008    80
     B        2015        2015      2015    79

But I need to get:

Title   StartYear   EndYear   EvaYear  Score
 A        2005        2008      2008    75
 B        2012        2015      2015    79

Upvotes: 2

Views: 107

Answers (2)

Uwe
Uwe

Reputation: 42592

As mentioned by r2evans, non-equi joins pick the columns from the second data.table. Columns of the first data.table can be referred to using the prefix x.:

setDT(Data1)[Data2, .(Title, x.StartYear, x.EndYear, EvaYear, Score), 
             on = .(Title, StartYear <= EvaYear, EndYear >= EvaYear)]
   Title x.StartYear x.EndYear EvaYear Score
1:     A        2005      2008    2008    75
2:     A        2008      2010    2008    80
3:     B        2012      2015    2015    79

The duplicate entries can be removed by mult = "first" as suggested by jangorecki:

setDT(Data1)[Data2, .(Title, StartYear = x.StartYear, EndYear = x.EndYear, EvaYear, Score), 
             on = .(Title, StartYear <= EvaYear, EndYear >= EvaYear), mult = "first"]
   Title StartYear EndYear EvaYear Score
1:     A      2005    2008    2008    75
2:     B      2012    2015    2015    79

I have also renamed the "x." columns to reproduce OP's expected result.

Upvotes: 2

r2evans
r2evans

Reputation: 160952

There are a couple of things to address. (1) You are not seeing 2012 in your actual output, solved by which frame is first in your join. (2) Removing near-duplicates.

The first is because it is that in inequality joins, it is keeping the columns from the other frame. Try swapping them.

Data3 <- Data2[Data1, .(Title, StartYear, EndYear, EvaYear, Score), on = .(Title, EvaYear >= StartYear, EvaYear <= EndYear)]
Data3
#    Title StartYear EndYear EvaYear Score
# 1:     A      2000    2005    2000   100
# 2:     A      2005    2008    2005    75
# 3:     A      2008    2010    2008    80
# 4:     B      2010    2012    2010    95
# 5:     B      2012    2015    2012    79

As you can see, though, EvaYear is changed, matching the StartYear that it matched. Let's try to fix that:

Data3 <- Data2[, EY := EvaYear][Data1, .(Title, StartYear, EndYear, EvaYear, Score), on = .(Title, EY >= StartYear, EY <= EndYear)
  ][ !is.na(EvaYear), 
    ][ order(Score),
      ][, .SD[1,], by = .(Title, EvaYear)]
Data3
#    Title EvaYear StartYear EndYear Score
# 1:     A    2008      2005    2008    75
# 2:     B    2015      2012    2015    79

Upvotes: 4

Related Questions