Reputation: 1211
Background:
I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d
intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d
contains information like VIN_prefix
(the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d
:
d <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
stringsAsFactors=FALSE)
Here, rows 3, 4, and 5 correspond to Japanese cars: the NA
in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.
d
also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.
Enter another dataset, a
, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d
is made by a Japanese firm. One of those is VIN_prefix
; the other is jp_makes
, a list of Japanese auto firms. Here's a
:
a <- data.frame(
VIN_prefix = c("JH","JF","1N"),
jp_makes = c("Acura","Subaru","Nissan"),
current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"),
stringsAsFactors=FALSE)
Here, we can see that the "Acura" make, missing in the car from row 3 in d
, could be identified by its VIN_prefix
"JH", which in row 3 of d
is not NA
.
Goal:
Left join a
onto d
so that each of the 3 Japanese cars in d
gets the relevant corresponding attributes from a
- mainly, current_ceo
. (Non-Japanese cars in d
would have NA
for columns joined from a
; this is fine.)
Problem:
As you can tell, the two relevant variables in d
that could be used as keys in a join - make
and VIN_prefix
- have missing data in d
. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes
or on d$VIN_prefix == a$VIN_prefix
, but they'd each be wrong due to the missing data in d
.
What to do?
What I've tried:
I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d
wind up with their correct information from a
:
try1 <- left_join(d, a, by = c("make" = "jp_makes"))
try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))
I can successfully generate an logical 'indicator' variable in d
that tells me whether a car is Japanese or not:
entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix
d<- d %>%
mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0)
%>% as.logical())
But that only gets me halfway: I still need those other columns from a
to sit next to those Japanese cars in d
. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.
Ideally, I'd like a dataset that looks something like this:
ideal <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
stringsAsFactors=FALSE)
What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!
Upvotes: 1
Views: 382
Reputation: 270248
Left join on an OR of the two conditions.
library(sqldf)
sqldf("select d.*, a.current_ceo
from d
left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")
giving:
make model_yr VIN_prefix current_ceo
1 GMC 1999 1G <NA>
2 Dodge 2004 1D <NA>
3 NA 1989 JH Toshihiro Mibe
4 Subaru 1999 JF Tomomi Nakamura
5 Nissan 2006 NA Makoto Ushida
6 Chrysler 2012 2C <NA>
Upvotes: 2
Reputation: 263481
Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make
valuesDo notice taht "NA" is not the same as NA
. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d
(note correction of df
) with values of 'jp_makes' that are taken from a
on the basis of matching VIN_prefix values:
is.na( d$make) <- df$make=="NA"
d$make[is.na(df$make)] <- a$jp_makes[
match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]
Now you have the make
values filled in on the basis of the table look up in a
. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'
merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
make model_yr VIN_prefix.x VIN_prefix.y current_ceo
1 Acura 1989 JH JH Toshihiro Mibe
2 Chrysler 2012 2C <NA> <NA>
3 Dodge 2004 1D <NA> <NA>
4 GMC 1999 1G <NA> <NA>
5 Nissan 2006 NA 1N Makoto Ushida
6 Subaru 1999 JF JF Tomomi Nakamura
You can then use the values in VIN_prefix.y
to replace the values the =="NA"
in VIN_prefix.x
.
Upvotes: 1