Reputation: 125
I have two datasets that I would like to join using R -
Dataset 1
ID Name Date Price
1 A 2011 $100
2 B 2012 $200
3 C 2013 $300
Dataset 2
ID Date Price
1 2012 $100
1 2013 $200
3 2014 $300
Using left-join()
in dplyr
by ID I'd end up with this
ID Name Date.x Price.x Date.y Price.y
1 A 2011 $100 2012 $100
1 A 2011 $100 2013 $200
2 B 2012 $200
3 C 2013 $300 2014 $300
What I would however like to have as a final product is this
ID Name Date Price
1 A 2011 $100
1 A 2012 $100
1 A 2013 $200
2 B 2012 $200
3 C 2013 $300
3 C 2014 $300
i.e instead of merging to the existing row, I'd like to create a new row when a match is found and duplicate the existing information that won't change (ID and Name) and alter the Date and Price column where necessary. Any ideas as to an efficient way to do this on a large dataset?
Upvotes: 6
Views: 404
Reputation: 31
> dsa
ID Name Date Price
1 1 A 2011 $100
2 2 B 2012 $200
3 3 C 2013 $300
>dsb
ID Date Price
1 1 2012 $100
2 1 2013 $200
3 3 2014 $300
>dsb$Name <- NA
>dsr <- rbind(dsa,dsb)
>dsr$Name <- dsa$Name[match(dsr$ID,dsa$ID)]
>dsr
ID Name Date Price
1 1 A 2011 $100
2 2 B 2012 $200
3 3 C 2013 $300
4 1 A 2012 $100
5 1 A 2013 $200
6 3 C 2014 $300
I am newbie to R. Couldn't make use of full potential of R for best efficiency. But, this does the job.
Upvotes: 0
Reputation: 12723
Inner join with nomatch = 0
. For example, if all ID in dataset2 is 4, inner join will not spit NA to non-matching IDs. If you remove nomatch = 0
, then NA
s will be produced.
EDIT: added rbindlist wrapper as per @Arun's suggestion
library("data.table")
rbindlist(list(df1,
setDT(df1)[i = df2,
j = .(ID, Name, Date = i.Date, Price = i.Price),
on = .(ID),
nomatch = 0]))
Output:
ID Name Date Price
1: 1 A 2011 $100
2: 2 B 2012 $200
3: 3 C 2013 $300
4: 1 A 2012 $100
5: 1 A 2013 $200
6: 3 C 2014 $300
Upvotes: 1
Reputation: 118809
This is a slight variation of @Frank's answer. The main issue is that your 2nd table doesn't have a Name
column. This can be obtained quite efficiently using data.table's update while join approach..
require(data.table)
dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back
Now that there's a Name
column, we can simply rbind
the result.
ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2)
If necessary, reorder the result by reference using setorder()
:
setorder(ans, ID, Name) # by reference, no need to assign the result back
# ID Name Date Price
# 1: 1 A 2011 $100
# 2: 1 A 2012 $100
# 3: 1 A 2013 $200
# 4: 2 B 2012 $200
# 5: 3 C 2013 $300
# 6: 3 C 2014 $300
:=
operator and set*
functions in data.table modify the input object by reference.
dt1 = fread('ID Name Date Price
1 A 2011 $100
2 B 2012 $200
3 C 2013 $300')
dt2 = fread('ID Date Price
1 2012 $100
1 2013 $200
3 2014 $300')
Upvotes: 4
Reputation: 11
Perhaps one of the efficient ways to do that is to use two steps merge.
# create Dataset 1
ID <- 1:3
Name <- c("A", "B", "C")
Date <- 2011:2013
Price <- c("$100", "$200", "$300")
dataset1 <- data.frame(ID, Name, Date, Price)
# Create Dataset 2
ID <- c(1,1,3)
Date <- 2012:2014
Price <- c("$100", "$200", "$300")
dataset2 <- data.frame(ID, Date, Price)
Assign missing "Name" values to Dataset 2 by using merge
function in {base} package
dataset2 <- merge(dataset1[c("ID", "Name")], dataset2)
Merge datasets
merge(dataset1, dataset2, all = T)
Which gives:
ID Name Date Price
1 1 A 2011 $100
2 1 A 2012 $100
3 1 A 2013 $200
4 2 B 2012 $200
5 3 C 2013 $300
6 3 C 2014 $300
Upvotes: 1
Reputation: 787
You can use Plyr to join and get the names for the second DF and rbind to union the rows.
library(plyr)
## Add the name column to df2 and get rid of unwanted columns
df3 <- join(df2,df1,by = "ID")
df3[,6] <- NULL
df3[,5] <- NULL
combined <- rbind(df1,df3)
Upvotes: 0
Reputation: 1363
df1 <- data.frame(
ID=1:3,
Name=c("A","B","C"),
Date=c(2011,2012,2013),
Price=c(100,200,300)
)
df2 <- data.frame(
ID=c(1,1,3),
Date=c(2012,2013,2014),
Price=c(100,200,300)
)
left_join
won't get you that desired output. You can use full_join
.
merged <- full_join(df1, df2, by=c("Date","ID"))
Here's a way to get to the output you want with melt
from the reshape2
package:
library(reshape2)
merged <- melt(merged, id.vars=c("ID","Name","Date"))
Then:
> merged[na.omit(merged$Name), -4] #remove NAs and column from melt
ID Name Date value
1 1 A 2011 100
2 2 B 2012 200
3 3 C 2013 300
1.1 1 A 2011 100
2.1 2 B 2012 200
3.1 3 C 2013 300
Upvotes: 1
Reputation: 66819
You asked about the efficient way, so I'll introduce data.table:
library(data.table)
setDT(DF1)
setDT(DF2)
# structure your data so ID attributes are only in an ID table
idDT = DF1[, .(ID, Name)]
DF1[, Name := NULL]
# stack data
DT = rbind(DF1, DF2)
# grab ID attributes if you really need them
DT[idDT, on="ID", Name := i.Name]
which gives
ID Date Price Name
1: 1 2011 $100 A
2: 2 2012 $200 B
3: 3 2013 $300 C
4: 1 2012 $100 A
5: 1 2013 $200 A
6: 3 2014 $300 C
rbind
for data.tables is pretty fast. I wouldn't really expect efficiency to be a big issue when just binding two tables, though.
Regarding spinning off the ID attribute, Name, it matches the recommendations of the dplyr package author, who refers to it as making data tidy.
Upvotes: 6