Reputation: 2283
I have the following dataframe (RES1):
"sequence" "support"
"1" "<{OV50}>" 0.286
"2" "<{OV148}>" 0.121
And additional dataframe (SRC2):
"sequenceID" "transactionID" "eventID" "items"
"1" 42207993 1577 1 "OV50"
"2" 42207993 6048 2 "OV11"
"3" 42207993 1597 3 "OV148"
"4" 57237976 12423 1 "OV56"
"5" 57237976 12589 2 "OV148"
I would like to get the following output dataframe (OUT3):
"sequenceID" "transactionID" "eventID" "items" "Exist" "Co"
"1" 42207993 1577 1 "OV50" 1
"2" 42207993 6048 2 "OV11" 0
"3" 42207993 1597 3 "OV148" 1 0.67
"4" 57237976 12423 1 "OV56" 0
"5" 57237976 12589 2 "OV148" 1 0.5
For Each row in SRC2 the "Exist" column in OUT3 will be '0' if there is no value at all in RES1. For instance, OV11 doesn't appear in RES1 at all so its value is 0. At the last value of sequenceID - the amount of '1' values is divided by the number of the same sequenceID and is added to "Co" column. In line 3 the there were 3 lines of sequenceID = 42207993 number of '1' are 2 so 2/3 = 0.67. I would like to find the most efficient way to do it because each of the dataframes is very big dataframe.
Upvotes: 0
Views: 46
Reputation: 886978
One option would be using data.table
. We convert the 'data.frame' to data.table
(setDT(SRC2)
), remove the punctuation characters in 'sequence' column of 'RES1' using gsub
, check whether it is present in the 'items', coerce the logical vector to binary by wrapping with +
and assign (:=
) the output to a new column 'Exist'. Grouped by 'sequenceID', we divided the sum
of 'Exist' by the nrow (.N
), round
, convert to 'character' and assign it as 'Co'. Then, we get the row index (.I
) of those elements that are not the last row per each 'sequenceID' and assign those to ''
.
library(data.table)#v1.9.6+
setDT(SRC2)[, Exist := +(items %chin% gsub('[^[:alnum:]]+',
'', RES1$sequence))]
i1 <- SRC2[, Co:= as.character(round(sum(Exist)/.N, 2)) ,
sequenceID][, .I[1:(.N-1)], sequenceID]$V1
SRC2[i1, Co:= '']
SRC2
# sequenceID transactionID eventID items Exist Co
#1: 42207993 1577 1 OV50 1
#2: 42207993 6048 2 OV11 0
#3: 42207993 1597 3 OV148 1 0.67
#4: 57237976 12423 1 OV56 0
#5: 57237976 12589 2 OV148 1 0.5
SRC2 <- structure(list(sequenceID = c(42207993L, 42207993L,
42207993L,
57237976L, 57237976L), transactionID = c(1577L, 6048L, 1597L,
12423L, 12589L), eventID = c(1L, 2L, 3L, 1L, 2L),
items = c("OV50",
"OV11", "OV148", "OV56", "OV148")), .Names = c("sequenceID",
"transactionID", "eventID", "items"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))
RES1 <- structure(list(sequence = c("<{OV50}>", "<{OV148}>"),
support = c(0.286,
0.121)), .Names = c("sequence", "support"),
class = "data.frame", row.names = c("1", "2"))
Upvotes: 1