Reputation: 23
I am wanting to join a dataframe using a function within 'by' of the left_join in R.
Savings_YC <- Savings_YC %>%
left_join(WAP_FULL, by = c("Client", "CompanyCode", "Material", "STOCK_UOM"))
What I want is to join Savings_YC to WAP_FULL on LEFT(Savings_YC$CompanyCode, 4) = WAP_FULL$CompanyCode. As an example this is what it would be in SQL:
SELECT * FROM Savings_YC syc
LEFT JOIN WAP_FULL w ON syc.Client = w.Client and LEFT(syc.CompanyCode, 4) = w.CompanyCode AND syc.Material = w.Material ...
I am assuming I will want to use substr(x, 1, 4) for the LEFT function, but how do I include that within the left_join function? I have tried the following and getting an error.
Savings_YC <- Savings_YC %>% # Not sure this will work
left_join(WAP_FULL, by = c("Client", substr("Savings_YC.CompanyCode",1,4) = "WAP_FULL.CompanyCode", "Material", "STOCK_UOM"))
Upvotes: 1
Views: 54
Reputation: 160417
Three options. Sample data:
dmnds <- head(ggplot2::diamonds, n=10)
other <- data.frame(cut = c("Very", "Prem"), newcol = 1:2)
If you can create a new column on the LHS,
dmnds %>%
mutate(cut2 = substring(cut, 1, 4)) %>%
left_join(other, by = c(cut2 = "cut"))
# # A tibble: 10 x 12
# carat cut color clarity depth table price x y z cut2 newcol
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <int>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 Idea NA
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Prem 2
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 Good NA
# 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Prem 2
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 Good NA
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very 1
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very 1
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very 1
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Fair NA
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very 1
Using fuzzyjoin
:
fuzzyjoin::regex_left_join(dmnds, other, by = "cut")
# # A tibble: 10 x 12
# carat cut.x color clarity depth table price x y z cut.y newcol
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <int>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 NA NA
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Prem 2
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 NA NA
# 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Prem 2
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 NA NA
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very 1
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very 1
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very 1
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 NA NA
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very 1
If you can use sqldf
then
### pattern-based
sqldf::sqldf("
select d.*, o.newcol
from dmnds d
left join other o on d.cut like (o.cut || '%')")
# carat cut color clarity depth table price x y z newcol
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 NA
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 2
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 NA
# 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 2
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 NA
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 1
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 1
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 1
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 NA
# 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39 1
### substring-based
sqldf::sqldf("
select d.*, o.newcol
from dmnds d
left join other o on substring(d.cut, 1, 4) = o.cut")
Upvotes: 1