Reputation: 341
I've having a bit of a problem using the merge function in R for two data frames. I have two large data frames which have columns and data in common with each other, and that's what I use to merge them.
For instance, data frame 1 has columns "Instrument" , "RecordDate", "HourMinuteSecond", "MilliSecond", .... (and several other additional columns) while data frame 2 has columns "Instrument" , "RecordDate", "HourMinuteSecond", "MilliSecond", .... (and several other different columns compared to data frame 2). I now use the merge function as follows:
DataFrame3 <- merge(DataFrame2, DataFrame1, by=c("Instrument", "RecordDate","HourMinuteSecond","MilliSecond"))
And this gives me a new data frame. Now, the problem I'm having is that some between the data frames was not matched at all. For example, I found one section of data which was not matched properly:
> DataFrame1[120486,]
Instrument RecordDate HourMinuteSecond MilliSecond
120486 DTE 6/4/2012 16:10:27 42
> DataFrame2[65,]
Instrument RecordDate HourMinuteSecond MilliSecond
65 DTE 6/4/2012 16:10:27 42
(Note that there are other columns but I have omitted them). I now compare these sections as if they are vectors. Firstly, I use the identical function to compare each value in each vector, and this gives the following:
> identical(DataFrame1[120486,1] ,DataFrame2[65,1])
[1] FALSE
> identical(DataFrame1[120486,2] ,DataFrame2[65,2])
[1] TRUE
> identical(DataFrame1[120486,3] ,DataFrame2[65,3])
[1] FALSE
> identical(DataFrame1[120486,4] ,DataFrame2[65,4])
[1] TRUE
From the identical function it appears as if the values in the columns "Instruments" and "HourMintuteSecond" are different from each other. Can anyone tell me what could be causing this issue? Thanks in advance.
EDIT: Here is the dput output, hopefully this is what you were all referring to:
> dput(droplevels(DataFrame2[65,1:4]))
structure(list(Instrument = structure(1L, .Label = "DTE", class = "factor"),
RecordDate = structure(1L, .Label = "6/4/2012", class = "factor"),
HourMinuteSecond = structure(1L, .Label = "16:10:27", class = "factor"),
MilliSecond = 42L), .Names = c("Instrument", "RecordDate",
"HourMinuteSecond", "MilliSecond"), row.names = 65L, class = "data.frame")
> dput(droplevels(DataFrame1[120486,1:4]))
structure(list(Instrument = structure(1L, .Label = "DTE", class = "factor"),
RecordDate = structure(1L, .Label = "6/4/2012", class = "factor"),
HourMinuteSecond = structure(1L, .Label = "16:10:27", class = "factor"),
MilliSecond = 42L), .Names = c("Instrument", "RecordDate",
"HourMinuteSecond", "MilliSecond"), row.names = 120486L, class = "data.frame")
And here is the str() output:
> str(DataFrame1)
'data.frame': 317495 obs. of 9 variables:
$ Instrument : Factor w/ 4 levels "CDD","DTE","ERA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ RecordDate : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HourMinuteSecond: Factor w/ 21763 levels "10:02:02","10:02:03",..: 14 14 14 17 19 22 24 25 25 25 ...
$ MilliSecond : int 26 57 158 70 73 8 926 448 457 458 ...
$ L1BidPrice : num 6.91 6.91 6.91 6.91 6.91 6.91 6.9 6.9 6.89 6.89 ...
$ L1BidVolume : int 520 504 504 504 504 508 20 4 20 20 ...
$ L1AskPrice : num 6.92 6.92 6.92 6.92 6.92 6.92 6.91 6.91 6.9 6.9 ...
$ L1AskVolume : int 3917 3917 3915 3932 3915 3915 3407 3407 13 30 ...
$ Midquote : num 6.92 6.92 6.92 6.92 6.92 ...
> str(DataFrame2)
'data.frame': 577 obs. of 15 variables:
$ Instrument : Factor w/ 2 levels "DTE","ERA": 1 1 1 1 1 1 1 1 1 1 ...
$ RecordDate : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 2 2 2 2 ...
$ HourMinuteSecond : Factor w/ 317 levels "10:02:10","10:02:21",..: 301 301 301 301 301 301 2 98 129 130 ...
$ MilliSecond : int 45 45 45 45 45 45 485 6 92 300 ...
$ RecordType : Factor w/ 1 level "TRADE": 1 1 1 1 1 1 1 1 1 1 ...
$ Price : num 0.195 0.195 0.195 0.195 0.195 0.195 0.2 0.19 0.19 0.185 ...
$ Volume : int 2686 6350 6350 6350 1620 3064 1 13986 25000 23092 ...
$ UndisclosedVolume: Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
$ DollarValue : num 524 1238 1238 1238 316 ...
$ Qualifiers : Factor w/ 4 levels "\\N","AC","Bi",..: 2 2 2 2 2 2 4 4 3 4 ...
$ BidID : num 6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
$ AskID : num 6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
$ BidOrAsk : Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
$ BuyerBrokerID : int 229 229 229 229 229 229 236 129 229 112 ...
$ SellerBrokerID : int 297 210 210 210 110 157 229 229 299 229 ...
Upvotes: 0
Views: 2769
Reputation: 12703
# load data table library used for large data sets
library('data.table')
# convert factors into character
col1 <- colnames(df1)[sapply(df1, is.factor)] # get columns that are factors for df1
col2 <- colnames(df2)[sapply(df2, is.factor)] # get columns that are factors for df2
for(col in col1){ # df1
set(df1, , col, as.character( df1[[col]] ) ) # for more info on set() function, read ?`:=`
}
for(col in col2){ # df2
set(df2, , col, as.character( df2[[col]] ) )
}
# join two data frames by the selected columns in 'on' argument
setDT(df1)[df2, on = c('Instrument', 'RecordDate', 'HourMinuteSecond','MilliSecond')] # setDT converts data frame to data table by reference
# Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote i.L1BidPrice i.L1BidVolume i.L1AskPrice i.L1AskVolume
# 1: DTE 6/4/2012 16:10:27 42 6.91 520 6.92 3917 6.92 7 8 9 10
# i.Midquote
# 1: 11
# merge function in data table is faster than the same function in base R function. You just convert data frame into data tables.
setDT(df1)
setDT(df2)
merge(df1, df2, by = c('Instrument', 'RecordDate', 'HourMinuteSecond','MilliSecond'))
Data:
df1 <- structure(list(Instrument = "DTE", RecordDate = "6/4/2012", HourMinuteSecond = "16:10:27",
MilliSecond = 42L, L1BidPrice = 6.91, L1BidVolume = 520,
L1AskPrice = 6.92, L1AskVolume = 3917, Midquote = 6.92), .Names = c("Instrument",
"RecordDate", "HourMinuteSecond", "MilliSecond", "L1BidPrice",
"L1BidVolume", "L1AskPrice", "L1AskVolume", "Midquote"), row.names = c(NA, -1L), class = "data.frame")
df2 <- structure(list(Instrument = "DTE", RecordDate = "6/4/2012", HourMinuteSecond = "16:10:27",
MilliSecond = 42L, L1BidPrice = 7, L1BidVolume = 8, L1AskPrice = 9,
L1AskVolume = 10, Midquote = 11), .Names = c("Instrument",
"RecordDate", "HourMinuteSecond", "MilliSecond", "L1BidPrice",
"L1BidVolume", "L1AskPrice", "L1AskVolume", "Midquote"), row.names = 120486L, class = "data.frame")
df1
# Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote
# 1: DTE 6/4/2012 16:10:27 42 6.91 520 6.92 3917 6.92
df2
# Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote
# 120486: DTE 6/4/2012 16:10:27 42 7 8 9 10 11
Upvotes: 1