Reputation: 38225
I am using the following code for finding number of occurrences of a word memory
in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"! NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
Upvotes: 5
Views: 1751
Reputation: 1401
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan
encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names
array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan
to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep
. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
Upvotes: 7
Reputation: 7905
As pointed by @andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10
Upvotes: 3