Reputation: 2596
I am trying to combine information from multiple data frames. Suppose I have a master data frame called "master" regarding information about a bunch of cars. The Master data frame has ALL IDs possible. Then suppose I have 2 other data frames, each with SOME of the ids and the color of the cars but not all. I want to put all of the colors of the cars, corresponding to their IDs in the existing Master data frame.
Master df looks like this:
id year
4D 2005
5A 2003
7T 1999
8Q 2001
3G 1998
6N 2009
2B 2013
df1 looks like this:
id color
5A black
7T green
8Q gold
4D white
df2 looks like this:
id color
5A NA
6N purple
3G NA
2B orange
Here's the code to make the 4 data frames:
master <- data.frame(id = c("4D", "5A", "7T", "8Q", "3G", "6N", "2B"), year=c(2005,2003,1999,2001,1998,2009,2013))
df1 <- data.frame(id=c("5A", "7T", "8Q", "4D"), color=c("black", "green", "gold", "white"))
df2 <- data.frame(id=c("5A", "6N", "3G", "2B"), color=c(NA,"purple", NA, "orange"))
I want the final master data frame to look like this:
id year color
4D 2005 white
5A 2003 black
7T 1999 green
8Q 2001 gold
3G 1998 NA
6N 2009 purple
2B 2013 orange
If there is value in one data frame and an NA in the other data frame for the same ID (ie: ID 5A has "black" in df1 but NA in df2), the value should be present in the spreadsheet.
Any thoughts? I've looked at the merge and stack functions but I don't think these are the ways to go about it?
Upvotes: 0
Views: 409
Reputation: 3082
Here is another option.
1, My approach is to combine df1 and df2 first then remove those color is NA using complete.cases
2, Then merge master with the new df3. (all=TRUE
, ensure those id not included in df3 will shown up in the combined data.frame)
df3 <- rbind(df1,df2) #row combine for df1 and df2
df3 <- df3[complete.cases(df3),] #remove row with NA
merge(master, df3, by="id", all=TRUE) #merge master and df3 based on id
Result is
id year color
1 2B 2013 orange
2 3G 1998 <NA>
3 4D 2005 white
4 5A 2003 black
5 6N 2009 purple
6 7T 1999 green
7 8Q 2001 gold
Cheers.
Upvotes: 1
Reputation: 1616
You could also use the sqldf
package once you combine your df1
and df2
data frames by row:
*Assumes the same input from your example above (e.g. values for master
, df1
, df2
)
> id_color <- rbind(df1, df2)
> library(sqldf)
> data <- sqldf("SELECT m.*, i.color FROM master m, id_color i WHERE m.id = i.id")
> data
id year color
1 4D 2005 white
2 5A 2003 <NA>
3 5A 2003 black
4 7T 1999 green
5 8Q 2001 gold
6 3G 1998 <NA>
7 6N 2009 purple
8 2B 2013 orange
> data <- sqldf("SELECT id, year, MIN(color) FROM data GROUP BY id, year") #removes duplicate with NULL
> data
id year MIN(color)
1 2B 2013 orange
2 3G 1998 <NA>
3 4D 2005 white
4 5A 2003 black
5 6N 2009 purple
6 7T 1999 green
7 8Q 2001 gold
Upvotes: 0
Reputation: 5856
df <- Master
df1
df2
dft <- rbind(df1, df2)
df3 <- dft[with(dft, order(id, color)), ]
df3 <- df3[!duplicated(df3$color), ]
merge(df, df3, by = 'id', all.x = T)
id year color
1 2B 2013 orange
2 3G 1998 <NA>
3 4D 2005 white
4 5A 2003 black
5 6N 2009 purple
6 7T 1999 green
7 8Q 2001 gold
Upvotes: 1