Jakob Orel
Jakob Orel

Reputation: 23

Left_join in R using modified column

I am wanting to join a dataframe using a function within 'by' of the left_join in R.

Savings_YC <- Savings_YC %>%
  left_join(WAP_FULL, by = c("Client", "CompanyCode", "Material", "STOCK_UOM"))

What I want is to join Savings_YC to WAP_FULL on LEFT(Savings_YC$CompanyCode, 4) = WAP_FULL$CompanyCode. As an example this is what it would be in SQL:

SELECT * FROM Savings_YC syc
    LEFT JOIN WAP_FULL w ON syc.Client = w.Client and LEFT(syc.CompanyCode, 4) = w.CompanyCode AND syc.Material = w.Material ...

I am assuming I will want to use substr(x, 1, 4) for the LEFT function, but how do I include that within the left_join function? I have tried the following and getting an error.

Savings_YC <- Savings_YC %>% # Not sure this will work
  left_join(WAP_FULL, by = c("Client", substr("Savings_YC.CompanyCode",1,4) = "WAP_FULL.CompanyCode", "Material", "STOCK_UOM"))

Upvotes: 1

Views: 54

Answers (1)

r2evans
r2evans

Reputation: 160417

Three options. Sample data:

dmnds <- head(ggplot2::diamonds, n=10)
other <- data.frame(cut = c("Very", "Prem"), newcol = 1:2)
  1. If you can create a new column on the LHS,

    dmnds %>%
      mutate(cut2 = substring(cut, 1, 4)) %>%
      left_join(other, by = c(cut2 = "cut"))
    # # A tibble: 10 x 12
    #    carat cut       color clarity depth table price     x     y     z cut2  newcol
    #    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <int>
    #  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 Idea      NA
    #  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 Prem       2
    #  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 Good      NA
    #  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 Prem       2
    #  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 Good      NA
    #  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 Very       1
    #  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Very       1
    #  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 Very       1
    #  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 Fair      NA
    # 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 Very       1
    
  2. Using fuzzyjoin:

    fuzzyjoin::regex_left_join(dmnds, other, by = "cut")
    # # A tibble: 10 x 12
    #    carat cut.x     color clarity depth table price     x     y     z cut.y newcol
    #    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <int>
    #  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 NA        NA
    #  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 Prem       2
    #  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 NA        NA
    #  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 Prem       2
    #  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 NA        NA
    #  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 Very       1
    #  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Very       1
    #  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 Very       1
    #  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 NA        NA
    # 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 Very       1
    
  3. If you can use sqldf then

    ### pattern-based
    sqldf::sqldf("
      select d.*, o.newcol
      from dmnds d
        left join other o on d.cut like (o.cut || '%')")
    #    carat       cut color clarity depth table price    x    y    z newcol
    # 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43     NA
    # 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31      2
    # 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31     NA
    # 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63      2
    # 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75     NA
    # 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48      1
    # 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47      1
    # 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53      1
    # 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49     NA
    # 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39      1
    
    ### substring-based
    sqldf::sqldf("
      select d.*, o.newcol
      from dmnds d
        left join other o on substring(d.cut, 1, 4) = o.cut")
    

Upvotes: 1

Related Questions