Reputation: 646
I have the following issue related with matching different data frames.
In the first place, I have the next table:
table<-data.frame(brand=c('duna','cars','cars','sea','sea','sea','mega','moon','moon'),model=c('mm','mm','mm','ll','ll','ll','tr','tr','tr'),version=c("2.8 sr cab. dupla 4x4 tdi","2.0 lsdakar 16v 4x4 hi-flex 5-p","2.4 ls cab. simples 4x2 flex 2-p","2.3 xl cab. simples 4x2 2-p","1.8 sx 5-p","1.0 mpfi joy 8v","hatch ls 1.0 8v","2.3 xlt cab. dupla 4x2 limited 4-p","1.4 fire ce xlt flex 2-p"))
brand model version
1 duna mm 2.8 sr cab. dupla 4x4 tdi
2 cars mm 2.0 lsdakar 16v 4x4 hi-flex 5-p
3 cars mm 2.4 ls cab. simples 4x2 flex 2-p
4 sea ll 2.3 xl cab. simples 4x2 2-p
5 sea ll 1.8 sx 5-p
6 sea ll 1.0 mpfi joy 8v
7 mega tr hatch ls 1.0 8v
8 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p
9 moon tr 1.4 fire ce xlt flex 2-p
And I have to match it with the next one:
table_match<-data.frame(brand=c('duna','cars','sea','mega','moon'),model=c('mm','mm','ll','tr','tr'),version=c('tdi','ls','xl','ls','xlt'))
table_match$id<-paste0(table_match$brand,table_match$model,table_match$version)
brand model version id
1 duna mm tdi dunammtdi
2 cars mm ls carsmmls
3 sea ll xl seallxl
4 mega tr ls megatrls
5 moon tr xlt moontrxlt
So, the issue here is to match brand
, model
and version
from table_match
into table
.
For example, in table
if brand=duna
, model=mm
and version
contains the exact word "tdi"
, then it is a matching! so the id
(in table_match
), with that matching, will be located next to version
.
brand model version match
1 duna mm 2.8 sr cab. dupla 4x4 tdi dunammtdi
2 cars mm 2.0 lsdakar 16v 4x4 hi-flex 5-p
3 cars mm 2.4 ls cab. simples 4x2 flex 2-p carsmmls
4 sea ll 2.3 xl cab. simples 4x2 2-p seallxl
5 sea ll 1.8 sx 5-p
6 sea ll 1.0 mpfi joy 8v
7 mega tr hatch ls 1.0 8v megatrls
8 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p moontrxlt
9 moon tr 1.4 fire ce xlt flex 2-p moontrxlt
Upvotes: 2
Views: 118
Reputation: 4378
fuzzy_join on two equal conditions ==
and the third str_detect
from stringr
. I don't know why the fuzzy join makes three duplicates of each, so added unique()
NOTE: added stringsAsFactors = FALSE
to test data
table<-data.frame(brand=c('duna','cars','cars','sea','sea','sea','mega','moon','moon'),
model=c('mm','mm','mm','ll','ll','ll','tr','tr','tr'),
version=c("2.8 sr cab. dupla 4x4 tdi","2.0 lsdakar 16v 4x4 hi-flex 5-p","2.4 ls cab. simples 4x2 flex 2-p","2.3 xl cab. simples 4x2 2-p","1.8 sx 5-p","1.0 mpfi joy 8v","hatch ls 1.0 8v","2.3 xlt cab. dupla 4x2 limited 4-p","1.4 fire ce xlt flex 2-p")
stringsAsFactors = FALSE)
table_match<-data.frame(brand=c('duna','cars','sea','mega','moon'),
model=c('mm','mm','ll','tr','tr'),
version=c('tdi','ls','xl','ls','xlt'),
stringsAsFactors = FALSE)
library(fuzzyjoin)
library(stringr)
fuzzy_join(table, table_match,
by = c("brand", "model", "version"),
match_fun = c(`==`, `==`, function(x,y) { str_detect(x, paste0("\\b", y, "\\b" ))})) %>%
unique()
# brand.x model.x version.x brand.y model.y version.y
# 1 duna mm 2.8 sr cab. dupla 4x4 tdi duna mm tdi
# 4 cars mm 2.4 ls cab. simples 4x2 flex 2-p cars mm ls
# 7 sea ll 2.3 xl cab. simples 4x2 2-p sea ll xl
# 10 mega tr hatch ls 1.0 8v mega tr ls
# 13 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p moon tr xlt
# 16 moon tr 1.4 fire ce xlt flex 2-p moon tr xlt
Upvotes: 1
Reputation: 3053
You can also try regex_join()
from the fuzzyjoin
package. N.B. I have added space around the string 'ls' in table_match
so that the regular expression does not match the string 'lsdakar' in table
, since this is not what the original poster wanted.
library(fuzzyjoin)
# Use data_frame() to get rid of stringsAsFactors problem
table <-
data_frame(
brand = c('duna', 'cars', 'cars', 'sea', 'sea', 'sea', 'mega', 'moon', 'moon'),
model = c('mm', 'mm', 'mm', 'll', 'll', 'll', 'tr', 'tr', 'tr'),
version = c(
"2.8 sr cab. dupla 4x4 tdi",
"2.0 lsdakar 16v 4x4 hi-flex 5-p",
"2.4 ls cab. simples 4x2 flex 2-p",
"2.3 xl cab. simples 4x2 2-p",
"1.8 sx 5-p",
"1.0 mpfi joy 8v",
"hatch ls 1.0 8v",
"2.3 xlt cab. dupla 4x2 limited 4-p",
"1.4 fire ce xlt flex 2-p"
)
)
# Use data_frame() here too
table_match <-
data_frame(
brand = c('duna', 'cars', 'sea', 'mega', 'moon'),
model = c('mm', 'mm', 'll', 'tr', 'tr'),
version = c('tdi', ' ls ', 'xl', ' ls ', 'xlt')
)
# Use regex_semi_join to find the matches
regex_semi_join(table, table_match, by = c(brand = "brand",
model = "model", version = "version"))
# A tibble: 6 × 3
brand model version
<chr> <chr> <chr>
1 duna mm 2.8 sr cab. dupla 4x4 tdi
2 cars mm 2.4 ls cab. simples 4x2 flex 2-p
3 sea ll 2.3 xl cab. simples 4x2 2-p
4 mega tr hatch ls 1.0 8v
5 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p
6 moon tr 1.4 fire ce xlt flex 2-p
>
# Use regex_anti_join to find the non-matches
regex_anti_join(table, table_match, by = c(brand = "brand",
model = "model", version = "version"))
# A tibble: 3 × 3
brand model version
<chr> <chr> <chr>
1 cars mm 2.0 lsdakar 16v 4x4 hi-flex 5-p
2 sea ll 1.8 sx 5-p
3 sea ll 1.0 mpfi joy 8v
>
Upvotes: 2
Reputation: 2806
I think we can just do a merge and then filter it down using regex. Give this a shot
dat = merge(table, table_match, by = c("brand", "model"))
dat$match = mapply(function(x, y) grepl(paste("\\b", x, "\\b", sep = ""), y), dat$version.y, dat$version.x)
dat$match = ifelse(dat$match, dat$id, "")
dat = dat[ , !colnames(dat) %in% c("version.y", "id")]
colnames(dat)[3] = "version"
dat = dat[with(dat, order(brand)), ]
brand model version match
1 cars mm 2.0 lsdakar 16v 4x4 hi-flex 5-p
2 cars mm 2.4 ls cab. simples 4x2 flex 2-p carsmmls
3 duna mm 2.8 sr cab. dupla 4x4 tdi dunammtdi
4 mega tr hatch ls 1.0 8v megatrls
5 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p moontrxlt
6 moon tr 1.4 fire ce xlt flex 2-p moontrxlt
7 sea ll 2.3 xl cab. simples 4x2 2-p seallxl
8 sea ll 1.8 sx 5-p
9 sea ll 1.0 mpfi joy 8v
Upvotes: 3