Reputation: 1325
I am doing extensive work with a variety of word lists.
Please consider the following question that I have:
docText={"settlement", "new", "beginnings", "wildwood", "settlement", "book",
"excerpt", "agnes", "leffler", "perry", "my", "mother", "junetta",
"hally", "leffler", "brought", "my", "brother", "frank", "and", "me",
"to", "edmonton", "from", "monmouth", "illinois", "mrs", "matilda",
"groff", "accompanied", "us", "her", "husband", "joseph", "groff",
"my", "father", "george", "leffler", "and", "my", "uncle", "andrew",
"henderson", "were", "already", "in", "edmonton", "they", "came",
"in", "1910", "we", "arrived", "july", "1", "1911", "the", "sun",
"was", "shining", "when", "we", "arrived", "however", "it", "had",
"been", "raining", "for", "days", "and", "it", "was", "very",
"muddy", "especially", "around", "the", "cn", "train"}
searchWords={"the","for","my","and","me","and","we"}
Each of these lists are much longer (say 250 words in the searchWords
list and docText
being about 12,000 words).
Right now, I have the ability to figure out frequency of a given word by doing something like:
docFrequency=Sort[Tally[docText],#1[[2]]>#2[[2]]&];
Flatten[Cases[docFrequency,{"settlement",_}]][[2]]
But where I am getting hung up is on my quest to generate specific lists. Specifically, the issue of converting a list of words into a list of the frequency in which those words appear. I've tried to do this with Do
loops but have hit a wall.
I want to go through docText
with searchWords
and replace each element of docText with the sheer frequency of its appearance. I.e. since "settlement" appears twice, it would be replaced by 2 in the list, whereas since "my" appears 3 times, it would become 3. The list would then be something like 2,1,1,1,2, and so forth.
I suspect the answer lies somewhere in If[]
and Map[]
?
This all sounds weird, but I am trying to pre-process a bunch of information for term frequency information…
Addition for Clarity (I hope):
Here is a better example.
searchWords={"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "A", "about",
"above", "across", "after", "again", "against", "all", "almost",
"alone", "along", "already", "also", "although", "always", "among",
"an", "and", "another", "any", "anyone", "anything", "anywhere",
"are", "around", "as", "at", "b", "B", "back", "be", "became",
"because", "become", "becomes", "been", "before", "behind", "being",
"between", "both", "but", "by", "c", "C", "can", "cannot", "could",
"d", "D", "do", "done", "down", "during", "e", "E", "each", "either",
"enough", "even", "ever", "every", "everyone", "everything",
"everywhere", "f", "F", "few", "find", "first", "for", "four",
"from", "full", "further", "g", "G", "get", "give", "go", "h", "H",
"had", "has", "have", "he", "her", "here", "herself", "him",
"himself", "his", "how", "however", "i", "I", "if", "in", "interest",
"into", "is", "it", "its", "itself", "j", "J", "k", "K", "keep", "l",
"L", "last", "least", "less", "m", "M", "made", "many", "may", "me",
"might", "more", "most", "mostly", "much", "must", "my", "myself",
"n", "N", "never", "next", "no", "nobody", "noone", "not", "nothing",
"now", "nowhere", "o", "O", "of", "off", "often", "on", "once",
"one", "only", "or", "other", "others", "our", "out", "over", "p",
"P", "part", "per", "perhaps", "put", "q", "Q", "r", "R", "rather",
"s", "S", "same", "see", "seem", "seemed", "seeming", "seems",
"several", "she", "should", "show", "side", "since", "so", "some",
"someone", "something", "somewhere", "still", "such", "t", "T",
"take", "than", "that", "the", "their", "them", "then", "there",
"therefore", "these", "they", "this", "those", "though", "three",
"through", "thus", "to", "together", "too", "toward", "two", "u",
"U", "under", "until", "up", "upon", "us", "v", "V", "very", "w",
"W", "was", "we", "well", "were", "what", "when", "where", "whether",
"which", "while", "who", "whole", "whose", "why", "will", "with",
"within", "without", "would", "x", "X", "y", "Y", "yet", "you",
"your", "yours", "z", "Z"}
These are the automatically generated stopwords from WordData[]
. So I want to compare these words against docText. Since "settlement" is NOT part of searchWords
, then it would appear as 0. But since "my" is part of searchWords
, it would pop up as the count (so I could tell how many times the given word appears).
I really do thank you for your help - I'm looking forward to taking some formal courses soon as I'm bumping up against the edge of my ability to really explain what I want to do!
Upvotes: 3
Views: 371
Reputation: 24336
I set out to solve this in a different way from Szabolcs but I ended up with something rather similar.
Nevertheless, I think it is cleaner. On some data it is faster, on others slower.
docText /.
Dispatch[FilterRules[Rule @@@ Tally@docText, searchWords] ~Join~ {_String -> 0}]
Upvotes: 2
Reputation: 22579
@Szabolcs gave a fine solution, and I'd probably go the same route myself. Here is a slightly different solution, just for fun:
ClearAll[getFreqs];
getFreqs[docText_, searchWords_] :=
Module[{dwords, dfreqs, inSearchWords, lset},
SetAttributes[{lset, inSearchWords}, Listable];
lset[args__] := Set[args];
{dwords, dfreqs} = Transpose@Tally[docText];
lset[inSearchWords[searchWords], True];
inSearchWords[_] = False;
dfreqs*Boole[inSearchWords[dwords]]]
This shows how Listable
attribute may be used to replace loops and even Map
-ping. We have:
In[120]:= getFreqs[docText,searchWords]
Out[120]= {0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,3,1,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,2,
1,0,0,2,0,0,1,0,2,0,2,0,1,1,2,1,1,0,1,0,1,0,0,1,0,0}
Upvotes: 4
Reputation: 25703
We can replace everything that doesn't appear in searchWords
by 0 in docText
as follows:
preprocessedDocText =
Replace[docText,
Dispatch@Append[Thread[searchWords -> searchWords], _ -> 0], {1}]
The we can replace the remaining words by their frequency:
replaceTable = Dispatch[Rule @@@ Tally[docText]];
preprocessedDocText /. replaceTable
Dispatch
preprocesses a list of rules (->
) and speeds up replacement significantly in subsequent uses.
I have not benchmarked this on large data, but Dispatch
should provide a good speedup.
Upvotes: 7