Reputation: 703
Three columns of my data.frame
contain subjects. I want to subset this data.frame
for different subjects. E.g. if I want to have a data.frame
with the subject "apple", the row should be selected if the word "apple" appears in one of the three columns.
doc <- c("blabla1", "blabla2", "blabla3", "blabla4")
subj.1 <- c("apple", "prune", "coconut", "berry")
subj.2 <- c("coconut", "apple", "cherry", "banana and prune")
subj.3 <- c("berry", "banana", "apple and berry", "pear", "prune")
subjects <- c("apple", "prune", "coconut", "berry", "cherry", "pear", "banana")
mydf <- data.frame(doc, subj.1, subj.2, subj.3, stringsAsFactors=FALSE)
mydf
# doc subj.1 subj.2 subj.3
# 1 blabla1 apple coconut berry
# 2 blabla2 prune apple banana
# 3 blabla3 coconut cherry apple and berry
# 4 blabla4 berry banana and prune pear
the output for subject "apple" should look like this:
# doc subj.1 subj.2 subj.3
# 1 blabla1 apple coconut berry
# 2 blabla2 prune apple banana
# 3 blabla3 coconut cherry apple and berry
EDIT1: In addition, let's say i have about 200 different subjects and therefor I want 200 different data.frames. How could I do that?
I tried a loop approach:
mylist <- vector('list', length(subjects))
for(i in 1:length(subjects)) {
pattern <- subjects[i]
filter <- grepl(pattern, ignore.case=T, mydf$subj.1)
grepl(pattern, ignore.case=T, mydf$subj.2)
grepl(pattern, ignore.case=T, mydf$subj.3)
subDF <- panel[filter,]
mylist[[i]] <- subDF
}
but there's the error:
Error in grepl(pattern, ignore.case = T, panel$SUBJECT.1) :
invalid regular expression 'C++ PROGRAMMING', reason 'Invalid use of repetition operators'
EDIT2: oh I see, in the original data.frame, one of the subjects is "C++ PROGRAMMING". Might that "++" cause the error?
Upvotes: 0
Views: 164
Reputation: 57220
You can use grepl
function :
pattern <- 'apple'
filter <- grepl(pattern, ignore.case=T, mydf$subj.1) |
grepl(pattern, ignore.case=T, mydf$subj.2) |
grepl(pattern, ignore.case=T, mydf$subj.3)
subDF <- mydf[filter,]
> subDF
doc subj.1 subj.2 subj.3
1 blabla1 apple coconut berry
2 blabla2 prune apple banana
3 blabla3 coconut cherry apple and berry
EDIT :
About your question on for-loop, I don't see any problem in using it, and I doubt using a apply-family function would give many benefits in term of execution time.
For the error, the problem is that the string pattern passed to grepl
has to be a valid regular expression but '+'
is a special character and so '++'
is not allowed.
Anyway, if you just want to check if the subject string is contained in the column, you can disable the regular expression engine by setting the grepl
argument fixed=TRUE
(
this means pattern is a string to be matched as is).
The only drawback is that ignore.case
cannot be used with fixed = TRUE
.
Upvotes: 2