logjammin
logjammin

Reputation: 1211

In R, left join two tables whose 2 potential keys contain missing data

Background:

I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:

d <- data.frame(
  make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"), 
  model_yr = c("1999","2004","1989","1999","2006","2012"), 
  VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
  stringsAsFactors=FALSE) 

Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.

d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.

Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:

a <- data.frame(
  VIN_prefix = c("JH","JF","1N"), 
  jp_makes = c("Acura","Subaru","Nissan"),
  current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"), 
  stringsAsFactors=FALSE)

Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.

Goal:

Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)

Problem:

As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.

What to do?

What I've tried:

I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:

try1 <- left_join(d, a, by = c("make" = "jp_makes"))

try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))

I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:

entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix

d<- d %>%
  mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0) 
         %>% as.logical())

But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.

Ideally, I'd like a dataset that looks something like this:

ideal <- data.frame(
  make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"), 
  model_yr = c("1999","2004","1989","1999","2006","2012"), 
  VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
  current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
  stringsAsFactors=FALSE) 

What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!

Upvotes: 1

Views: 382

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270248

Left join on an OR of the two conditions.

library(sqldf)

sqldf("select d.*, a.current_ceo
  from d
  left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")

giving:

      make model_yr VIN_prefix     current_ceo
1      GMC     1999         1G            <NA>
2    Dodge     2004         1D            <NA>
3       NA     1989         JH  Toshihiro Mibe
4   Subaru     1999         JF Tomomi Nakamura
5   Nissan     2006         NA   Makoto Ushida
6 Chrysler     2012         2C            <NA>

Upvotes: 2

IRTFM
IRTFM

Reputation: 263481

Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make valuesDo notice taht "NA" is not the same as NA. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d (note correction of df) with values of 'jp_makes' that are taken from a on the basis of matching VIN_prefix values:

 is.na( d$make) <- df$make=="NA"
 d$make[is.na(df$make)] <- a$jp_makes[
                              match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]

Now you have the make values filled in on the basis of the table look up in a. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'

 merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
      make model_yr VIN_prefix.x VIN_prefix.y     current_ceo
1    Acura     1989           JH           JH  Toshihiro Mibe
2 Chrysler     2012           2C         <NA>            <NA>
3    Dodge     2004           1D         <NA>            <NA>
4      GMC     1999           1G         <NA>            <NA>
5   Nissan     2006           NA           1N   Makoto Ushida
6   Subaru     1999           JF           JF Tomomi Nakamura

You can then use the values in VIN_prefix.y to replace the values the =="NA" in VIN_prefix.x.

Upvotes: 1

Related Questions