In R, left join two tables whose 2 potential keys contain missing data

Question

Background:

I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:

d <- data.frame(
  make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"), 
  model_yr = c("1999","2004","1989","1999","2006","2012"), 
  VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
  stringsAsFactors=FALSE)

Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.

d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.

Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:

a <- data.frame(
  VIN_prefix = c("JH","JF","1N"), 
  jp_makes = c("Acura","Subaru","Nissan"),
  current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"), 
  stringsAsFactors=FALSE)

Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.

Goal:

Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)

Problem:

As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.

What to do?

What I've tried:

I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:

try1 <- left_join(d, a, by = c("make" = "jp_makes"))

try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))

I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:

entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix

d<- d %>%
  mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0) 
         %>% as.logical())

But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.

Ideally, I'd like a dataset that looks something like this:

ideal <- data.frame(
  make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"), 
  model_yr = c("1999","2004","1989","1999","2006","2012"), 
  VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
  current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
  stringsAsFactors=FALSE)

What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!

G. Grothendieck · Accepted Answer

Left join on an OR of the two conditions.

library(sqldf)

sqldf("select d.*, a.current_ceo
  from d
  left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")

giving:

      make model_yr VIN_prefix     current_ceo
1      GMC     1999         1G            
2    Dodge     2004         1D            
3       NA     1989         JH  Toshihiro Mibe
4   Subaru     1999         JF Tomomi Nakamura
5   Nissan     2006         NA   Makoto Ushida
6 Chrysler     2012         2C

In R, left join two tables whose 2 potential keys contain missing data

Answers (2)

Related Questions