Avi
Avi

Reputation: 2283

Effecient way for finding values of one dataframe in another

I have the following dataframe (RES1):

    "sequence" "support"
"1" "<{OV50}>"   0.286
"2" "<{OV148}>"  0.121

And additional dataframe (SRC2):

  "sequenceID" "transactionID" "eventID" "items"
"1" 42207993       1577          1        "OV50"
"2" 42207993       6048          2        "OV11"
"3" 42207993       1597          3        "OV148"
"4" 57237976       12423         1        "OV56"
"5" 57237976       12589         2        "OV148"

I would like to get the following output dataframe (OUT3):

  "sequenceID" "transactionID" "eventID" "items"  "Exist" "Co"
"1" 42207993       1577          1        "OV50"     1
"2" 42207993       6048          2        "OV11"     0
"3" 42207993       1597          3        "OV148"    1       0.67
"4" 57237976       12423         1        "OV56"     0
"5" 57237976       12589         2        "OV148"    1       0.5

For Each row in SRC2 the "Exist" column in OUT3 will be '0' if there is no value at all in RES1. For instance, OV11 doesn't appear in RES1 at all so its value is 0. At the last value of sequenceID - the amount of '1' values is divided by the number of the same sequenceID and is added to "Co" column. In line 3 the there were 3 lines of sequenceID = 42207993 number of '1' are 2 so 2/3 = 0.67. I would like to find the most efficient way to do it because each of the dataframes is very big dataframe.

Upvotes: 0

Views: 46

Answers (1)

akrun
akrun

Reputation: 886978

One option would be using data.table. We convert the 'data.frame' to data.table (setDT(SRC2)), remove the punctuation characters in 'sequence' column of 'RES1' using gsub, check whether it is present in the 'items', coerce the logical vector to binary by wrapping with + and assign (:=) the output to a new column 'Exist'. Grouped by 'sequenceID', we divided the sum of 'Exist' by the nrow (.N), round, convert to 'character' and assign it as 'Co'. Then, we get the row index (.I) of those elements that are not the last row per each 'sequenceID' and assign those to ''.

library(data.table)#v1.9.6+
setDT(SRC2)[, Exist := +(items %chin% gsub('[^[:alnum:]]+', 
           '', RES1$sequence))]
i1 <- SRC2[, Co:= as.character(round(sum(Exist)/.N, 2)) , 
            sequenceID][, .I[1:(.N-1)], sequenceID]$V1
SRC2[i1, Co:= '']
SRC2
#   sequenceID transactionID eventID items Exist   Co
#1:   42207993          1577       1  OV50     1     
#2:   42207993          6048       2  OV11     0     
#3:   42207993          1597       3 OV148     1 0.67
#4:   57237976         12423       1  OV56     0     
#5:   57237976         12589       2 OV148     1  0.5

data

SRC2 <- structure(list(sequenceID = c(42207993L, 42207993L, 
 42207993L, 
 57237976L, 57237976L), transactionID = c(1577L, 6048L, 1597L, 
 12423L, 12589L), eventID = c(1L, 2L, 3L, 1L, 2L),
 items = c("OV50", 
 "OV11", "OV148", "OV56", "OV148")), .Names = c("sequenceID", 
 "transactionID", "eventID", "items"), class = "data.frame", 
 row.names = c("1", "2", "3", "4", "5"))

RES1 <- structure(list(sequence = c("<{OV50}>", "<{OV148}>"),
 support = c(0.286,    
 0.121)), .Names = c("sequence", "support"),
 class = "data.frame", row.names = c("1", "2"))

Upvotes: 1

Related Questions