Reputation: 4470
I have a data table, dt
:
V1 V2 V3 PubMedCounts
1: 0000100005 100-00-5 CAS Number 6
2: 0000100005 1-Chloro-4-nitrobenzene DescriptorName 12
3: 0000100005 aahs DescriptorName 111
4: 0000100005 PNCB Synonym 35
Also, I have a data table, ew
, which has only one columns with words, like:
V1
1: aah
2: aahed
3: aahing
4: aahs
5: aardvark
from dt
data table, i need to remove all the rows which have V2
size less than or equal to 5 or present in ew
data table.
Example, from dt
table, i would remove 3rd and 4th row.
I would like to use apply function to make it efficient as its pretty big data set
Upvotes: 0
Views: 123
Reputation: 34733
If I understand you correctly I would do:
dt[!ew, on = c(V2 = "V1")][nchar(V2) > 5]
which gives:
V1 V2 V3 PubMedCounts
1: 100005 100-00-5 CAS_Number 6
2: 100005 1-Chloro-4-nitrobenzene DescriptorName 12
Applying the conditions in the other order might be faster:
dt[nchar(V2) > 5][!ew, on = c(V2 = "V1")]
This prevents matching on things in dt
that would be deleted in the next step anyway.
A third possibility is using:
dt[nchar(V2) > 5 & !( V2 %chin% ew$V1 )]
Used data:
dt <- structure(list(V1 = c(100005L, 100005L, 100005L, 100005L), V2 = c("100-00-5",
"1-Chloro-4-nitrobenzene", "aahs", "PNCB"), V3 = c("CAS_Number",
"DescriptorName", "DescriptorName", "Synonym"), PubMedCounts = c(6L,
12L, 111L, 35L)), .Names = c("V1", "V2", "V3", "PubMedCounts"
), row.names = c(NA, -4L), class = c("data.table", "data.frame"))
ew <- structure(list(V1 = c("aah", "aahed", "aahing", "aahs", "aardvark")), .Names = "V1", row.names = c(NA, -5L), class = c("data.table", "data.frame"))
Upvotes: 2