Allan Davids
Allan Davids

Reputation: 125

Joining 2 datasets and creating new rows where matches are found

I have two datasets that I would like to join using R -

Dataset 1

    ID Name Date Price
    1    A   2011 $100
    2    B   2012 $200
    3    C   2013 $300

Dataset 2

    ID Date Price
    1  2012 $100
    1  2013 $200
    3  2014 $300

Using left-join() in dplyr by ID I'd end up with this

    ID Name Date.x Price.x Date.y Price.y
    1   A   2011    $100   2012   $100
    1   A   2011    $100   2013   $200
    2   B   2012    $200
    3   C   2013    $300   2014   $300

What I would however like to have as a final product is this

    ID Name Date Price
    1  A     2011 $100
    1  A     2012 $100
    1  A     2013 $200
    2  B     2012 $200
    3  C     2013 $300
    3  C     2014 $300

i.e instead of merging to the existing row, I'd like to create a new row when a match is found and duplicate the existing information that won't change (ID and Name) and alter the Date and Price column where necessary. Any ideas as to an efficient way to do this on a large dataset?

Upvotes: 6

Views: 404

Answers (7)

Sandesh
Sandesh

Reputation: 31

 > dsa
   ID Name Date Price
 1  1    A 2011  $100
 2  2    B 2012  $200
 3  3    C 2013  $300

 >dsb
  ID Date Price
 1  1 2012  $100
 2  1 2013  $200
 3  3 2014  $300

 >dsb$Name <- NA

 >dsr <- rbind(dsa,dsb)
 >dsr$Name <- dsa$Name[match(dsr$ID,dsa$ID)]
 >dsr

   ID Name Date Price
 1  1    A 2011  $100
 2  2    B 2012  $200
 3  3    C 2013  $300
 4  1    A 2012  $100
 5  1    A 2013  $200
 6  3    C 2014  $300

I am newbie to R. Couldn't make use of full potential of R for best efficiency. But, this does the job.

Upvotes: 0

Sathish
Sathish

Reputation: 12723

Inner join with nomatch = 0. For example, if all ID in dataset2 is 4, inner join will not spit NA to non-matching IDs. If you remove nomatch = 0, then NAs will be produced.

EDIT: added rbindlist wrapper as per @Arun's suggestion

library("data.table")
rbindlist(list(df1, 
               setDT(df1)[i = df2, 
                          j = .(ID, Name, Date = i.Date, Price = i.Price),
                          on = .(ID), 
                          nomatch = 0]))

Output:

   ID Name Date Price
1:  1    A 2011  $100
2:  2    B 2012  $200
3:  3    C 2013  $300
4:  1    A 2012  $100
5:  1    A 2013  $200
6:  3    C 2014  $300

Upvotes: 1

Arun
Arun

Reputation: 118809

This is a slight variation of @Frank's answer. The main issue is that your 2nd table doesn't have a Name column. This can be obtained quite efficiently using data.table's update while join approach..

require(data.table)
dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back

Now that there's a Name column, we can simply rbind the result.

ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2)

If necessary, reorder the result by reference using setorder():

setorder(ans, ID, Name) # by reference, no need to assign the result back
#    ID Name Date Price
# 1:  1    A 2011  $100
# 2:  1    A 2012  $100
# 3:  1    A 2013  $200
# 4:  2    B 2012  $200
# 5:  3    C 2013  $300
# 6:  3    C 2014  $300

:= operator and set* functions in data.table modify the input object by reference.


dt1 = fread('ID Name   Date Price
              1    A   2011  $100
              2    B   2012  $200
              3    C   2013  $300')

dt2 = fread('ID  Date Price
              1  2012  $100
              1  2013  $200
              3  2014  $300')

Upvotes: 4

aleksapaulius
aleksapaulius

Reputation: 11

Perhaps one of the efficient ways to do that is to use two steps merge.

# create Dataset 1
ID <- 1:3
Name <- c("A", "B", "C")
Date <- 2011:2013
Price <- c("$100", "$200", "$300")
dataset1 <- data.frame(ID, Name, Date, Price)

# Create Dataset 2
ID <- c(1,1,3)
Date <- 2012:2014
Price <- c("$100", "$200", "$300")
dataset2 <- data.frame(ID, Date, Price)

Assign missing "Name" values to Dataset 2 by using merge function in {base} package

dataset2 <- merge(dataset1[c("ID", "Name")], dataset2)

Merge datasets

merge(dataset1, dataset2, all = T)

Which gives:

   ID Name Date Price
1  1    A 2011  $100
2  1    A 2012  $100
3  1    A 2013  $200
4  2    B 2012  $200
5  3    C 2013  $300
6  3    C 2014  $300

Upvotes: 1

PhilC
PhilC

Reputation: 787

You can use Plyr to join and get the names for the second DF and rbind to union the rows.

library(plyr)

## Add the name column to df2 and get rid of unwanted columns
df3 <- join(df2,df1,by = "ID")
df3[,6] <- NULL
df3[,5] <- NULL

combined <- rbind(df1,df3)

Upvotes: 0

Warner
Warner

Reputation: 1363

df1 <- data.frame(
  ID=1:3,
  Name=c("A","B","C"),
  Date=c(2011,2012,2013),
  Price=c(100,200,300)
)

df2 <- data.frame(
  ID=c(1,1,3),
  Date=c(2012,2013,2014),
  Price=c(100,200,300)
)

left_join won't get you that desired output. You can use full_join.

merged <- full_join(df1, df2, by=c("Date","ID"))

Here's a way to get to the output you want with melt from the reshape2 package:

library(reshape2)
merged <- melt(merged, id.vars=c("ID","Name","Date"))

Then:

> merged[na.omit(merged$Name), -4] #remove NAs and column from melt
    ID Name Date value
1    1    A 2011   100
2    2    B 2012   200
3    3    C 2013   300
1.1  1    A 2011   100
2.1  2    B 2012   200
3.1  3    C 2013   300

Upvotes: 1

Frank
Frank

Reputation: 66819

You asked about the efficient way, so I'll introduce data.table:

library(data.table)
setDT(DF1)
setDT(DF2)

# structure your data so ID attributes are only in an ID table
idDT = DF1[, .(ID, Name)]
DF1[, Name := NULL]

# stack data
DT = rbind(DF1, DF2)

# grab ID attributes if you really need them
DT[idDT, on="ID", Name := i.Name]

which gives

   ID Date Price Name
1:  1 2011  $100    A
2:  2 2012  $200    B
3:  3 2013  $300    C
4:  1 2012  $100    A
5:  1 2013  $200    A
6:  3 2014  $300    C

rbind for data.tables is pretty fast. I wouldn't really expect efficiency to be a big issue when just binding two tables, though.

Regarding spinning off the ID attribute, Name, it matches the recommendations of the dplyr package author, who refers to it as making data tidy.

Upvotes: 6

Related Questions