user3106040
user3106040

Reputation:

Grep includes values that I have not specified

I'm trying to get a list of rows based on some column values, but when I grep for the values, the returned list of rows includes rows with values that I have not specified.

Specifically: I want rows in which df$likert_classification != NA, and df$variable is 'picture_trained', 'unambig_picture_proportion' or 'ambig_picture_proportion'. But the rows specified by grep includes 'criterion block' rows, which I haven't asked for.

url <- 'http://pastebin.com/raw.php?i=DbMasR8Y'
df <- read.csv(url, as.is=T)
picture_transfer_trials_likert <- grep('picture_trained|unambig_picture_proportion|ambig_picture_proportion', df$variable[which(!is.na(df$likert_classification))])
unique(df$variable[picture_transfer_trials_likert])

criterion block should not be included in the rows specified by picture_transfer_trials_likert, yet it is. What's going on here?

Upvotes: 0

Views: 60

Answers (2)

bgoldst
bgoldst

Reputation: 35314

The issue you ran into here is that this particular usage of grep() returns indexes of matches that apply to the exact vector that grep() received. You didn't pass df$variable to grep, rather, you passed a subset of that vector, specifically df$variable[which(!is.na(df$likert_classification))]. If you're going to use the resulting indexes to index df$variable, you're going to have to subset it in the exact same way for the indexes to be applicable. So for example, this works:

unique(df$variable[!is.na(df$likert_classification)][picture_transfer_trials_likert]);
## [1] "unambig_picture_proportion" "picture_trained"            "ambig_picture_proportion"

Also notice that I omitted the which() call, which is unnecessary here, since you can use logical vector indexing directly, as opposed to numeric vector indexing, which is what you'd be doing with the which() call.

But it is also worth mentioning that grep() has another very useful mode: You can pass value=T to get the actual matches, rather than match indexes, and thus your requirement can be accomplished in one line:

unique(grep('^picture_trained$|^unambig_picture_proportion$|^ambig_picture_proportion$', df$variable[!is.na(df$likert_classification)], value=T ));
## [1] "unambig_picture_proportion" "picture_trained"            "ambig_picture_proportion"

Also notice that I included anchors in the regex, because it looks to me like you're trying to do an exact full string match. But if that is really the case, then you shouldn't even be using regular expressions, you should be doing string equality testing.

And actually, further to the above, upon looking at your logic, it looks to me like the end result of your code is that you're taking three fixed strings and simply returning whichever of them can be found anywhere in df$variable in a row where df$likert_classification is not NA. This can actually be done with intersect(), passing the vector of fixed strings as the first argument, and the vector of unique values in df$variable excluding those in rows with an NA value in df$likert_classification as the second argument:

intersect(c('picture_trained','unambig_picture_proportion','ambig_picture_proportion'),unique(df$variable[!is.na(df$likert_classification)]));
## [1] "picture_trained"            "unambig_picture_proportion" "ambig_picture_proportion"

Upvotes: 3

jalapic
jalapic

Reputation: 14202

For some reason it is to do with how you're filtering the NA observations. I found that this workaround works... I just changed the grep to "picture" as it was faster to type:

picture_transfer_trials_likert <- grep('picture', df$variable[df$likert_classification!='NA'])
unique(df$variable[picture_transfer_trials_likert])

#[1] "unambig_picture_proportion" "picture_trained"            "ambig_picture_proportion"

This uses your original grep:

picture_transfer_trials_likert <- grep('picture_trained|unambig_picture_proportion|ambig_picture_proportion', df$variable[df$likert_classification!='NA'])
unique(df$variable[picture_transfer_trials_likert])

#[1] "unambig_picture_proportion" "picture_trained"            "ambig_picture_proportion"

Upvotes: 0

Related Questions