ThePlowKing
ThePlowKing

Reputation: 341

Merging two Data Frames in R (Data frames not merging properly)

I've having a bit of a problem using the merge function in R for two data frames. I have two large data frames which have columns and data in common with each other, and that's what I use to merge them.

For instance, data frame 1 has columns "Instrument" , "RecordDate", "HourMinuteSecond", "MilliSecond", .... (and several other additional columns) while data frame 2 has columns "Instrument" , "RecordDate", "HourMinuteSecond", "MilliSecond", .... (and several other different columns compared to data frame 2). I now use the merge function as follows:

DataFrame3 <- merge(DataFrame2, DataFrame1, by=c("Instrument", "RecordDate","HourMinuteSecond","MilliSecond"))

And this gives me a new data frame. Now, the problem I'm having is that some between the data frames was not matched at all. For example, I found one section of data which was not matched properly:

> DataFrame1[120486,]
       Instrument RecordDate HourMinuteSecond MilliSecond 
120486        DTE   6/4/2012         16:10:27          42     

 > DataFrame2[65,]
   Instrument RecordDate HourMinuteSecond MilliSecond 
65        DTE   6/4/2012         16:10:27          42  

(Note that there are other columns but I have omitted them). I now compare these sections as if they are vectors. Firstly, I use the identical function to compare each value in each vector, and this gives the following:

> identical(DataFrame1[120486,1] ,DataFrame2[65,1])
[1] FALSE
> identical(DataFrame1[120486,2] ,DataFrame2[65,2])
[1] TRUE
> identical(DataFrame1[120486,3] ,DataFrame2[65,3])
[1] FALSE
> identical(DataFrame1[120486,4] ,DataFrame2[65,4])
[1] TRUE

From the identical function it appears as if the values in the columns "Instruments" and "HourMintuteSecond" are different from each other. Can anyone tell me what could be causing this issue? Thanks in advance.

EDIT: Here is the dput output, hopefully this is what you were all referring to:

> dput(droplevels(DataFrame2[65,1:4]))
structure(list(Instrument = structure(1L, .Label = "DTE", class = "factor"), 
RecordDate = structure(1L, .Label = "6/4/2012", class = "factor"), 
HourMinuteSecond = structure(1L, .Label = "16:10:27", class = "factor"), 
MilliSecond = 42L), .Names = c("Instrument", "RecordDate", 
"HourMinuteSecond", "MilliSecond"), row.names = 65L, class = "data.frame")

> dput(droplevels(DataFrame1[120486,1:4]))
structure(list(Instrument = structure(1L, .Label = "DTE", class = "factor"), 
RecordDate = structure(1L, .Label = "6/4/2012", class = "factor"), 
HourMinuteSecond = structure(1L, .Label = "16:10:27", class = "factor"), 
MilliSecond = 42L), .Names = c("Instrument", "RecordDate", 
"HourMinuteSecond", "MilliSecond"), row.names = 120486L, class = "data.frame")

And here is the str() output:

> str(DataFrame1)
'data.frame':   317495 obs. of  9 variables:
 $ Instrument      : Factor w/ 4 levels "CDD","DTE","ERA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ RecordDate      : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HourMinuteSecond: Factor w/ 21763 levels "10:02:02","10:02:03",..: 14 14 14 17 19 22 24 25 25 25 ...
 $ MilliSecond     : int  26 57 158 70 73 8 926 448 457 458 ...
 $ L1BidPrice      : num  6.91 6.91 6.91 6.91 6.91 6.91 6.9 6.9 6.89 6.89 ...
 $ L1BidVolume     : int  520 504 504 504 504 508 20 4 20 20 ...
 $ L1AskPrice      : num  6.92 6.92 6.92 6.92 6.92 6.92 6.91 6.91 6.9 6.9 ...
 $ L1AskVolume     : int  3917 3917 3915 3932 3915 3915 3407 3407 13 30 ...
 $ Midquote        : num  6.92 6.92 6.92 6.92 6.92 ...

> str(DataFrame2)
'data.frame':   577 obs. of  15 variables:
 $ Instrument       : Factor w/ 2 levels "DTE","ERA": 1 1 1 1 1 1 1 1 1 1 ...
 $ RecordDate       : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 2 2 2 2 ...
 $ HourMinuteSecond : Factor w/ 317 levels "10:02:10","10:02:21",..: 301 301 301 301 301 301 2 98 129 130 ...
 $ MilliSecond      : int  45 45 45 45 45 45 485 6 92 300 ...
 $ RecordType       : Factor w/ 1 level "TRADE": 1 1 1 1 1 1 1 1 1 1 ...
 $ Price            : num  0.195 0.195 0.195 0.195 0.195 0.195 0.2 0.19 0.19 0.185 ...
 $ Volume           : int  2686 6350 6350 6350 1620 3064 1 13986 25000 23092 ...
 $ UndisclosedVolume: Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
 $ DollarValue      : num  524 1238 1238 1238 316 ...
 $ Qualifiers       : Factor w/ 4 levels "\\N","AC","Bi",..: 2 2 2 2 2 2 4 4 3 4 ...
 $ BidID            : num  6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
 $ AskID            : num  6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
 $ BidOrAsk         : Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
 $ BuyerBrokerID    : int  229 229 229 229 229 229 236 129 229 112 ...
 $ SellerBrokerID   : int  297 210 210 210 110 157 229 229 299 229 ...

Upvotes: 0

Views: 2769

Answers (1)

Sathish
Sathish

Reputation: 12703

# load data table library used for large data sets
library('data.table')

# convert factors into character
col1 <- colnames(df1)[sapply(df1, is.factor)]  # get columns that are factors for df1
col2 <- colnames(df2)[sapply(df2, is.factor)]  # get columns that are factors for df2

for(col in col1){   # df1
  set(df1, , col, as.character( df1[[col]] ) )    # for more info on set() function, read ?`:=`
}

for(col in col2){   # df2
  set(df2, , col, as.character( df2[[col]] ) )
}

# join two data frames by the selected columns in 'on' argument
setDT(df1)[df2, on = c('Instrument', 'RecordDate', 'HourMinuteSecond','MilliSecond')]   # setDT converts data frame to data table by reference
#    Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote i.L1BidPrice i.L1BidVolume i.L1AskPrice i.L1AskVolume
# 1:        DTE   6/4/2012         16:10:27          42       6.91         520       6.92        3917     6.92            7             8            9            10
#    i.Midquote
# 1:         11

# merge function in data table is faster than the same function in base R function. You just convert data frame into data tables.
setDT(df1)
setDT(df2)
merge(df1, df2, by = c('Instrument', 'RecordDate', 'HourMinuteSecond','MilliSecond'))

Data:

df1 <- structure(list(Instrument = "DTE", RecordDate = "6/4/2012", HourMinuteSecond = "16:10:27", 
                      MilliSecond = 42L, L1BidPrice = 6.91, L1BidVolume = 520, 
                      L1AskPrice = 6.92, L1AskVolume = 3917, Midquote = 6.92), .Names = c("Instrument", 
                                                                                          "RecordDate", "HourMinuteSecond", "MilliSecond", "L1BidPrice", 
                                                                                          "L1BidVolume", "L1AskPrice", "L1AskVolume", "Midquote"), row.names = c(NA, -1L), class = "data.frame") 

df2 <- structure(list(Instrument = "DTE", RecordDate = "6/4/2012", HourMinuteSecond = "16:10:27", 
                      MilliSecond = 42L, L1BidPrice = 7, L1BidVolume = 8, L1AskPrice = 9, 
                      L1AskVolume = 10, Midquote = 11), .Names = c("Instrument", 
                                                                   "RecordDate", "HourMinuteSecond", "MilliSecond", "L1BidPrice", 
                                                                   "L1BidVolume", "L1AskPrice", "L1AskVolume", "Midquote"), row.names = 120486L, class = "data.frame")


df1
#    Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote
# 1:        DTE   6/4/2012         16:10:27          42       6.91         520       6.92        3917     6.92

df2
#        Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote
# 120486:        DTE   6/4/2012         16:10:27          42          7           8          9          10       11

Upvotes: 1

Related Questions