canadian_scholar
canadian_scholar

Reputation: 1325

Converting a list of words into a list of the frequency in which those words appear

I am doing extensive work with a variety of word lists.

Please consider the following question that I have:

docText={"settlement", "new", "beginnings", "wildwood", "settlement", "book",
"excerpt", "agnes", "leffler", "perry", "my", "mother", "junetta", 
"hally", "leffler", "brought", "my", "brother", "frank", "and", "me", 
"to", "edmonton", "from", "monmouth", "illinois", "mrs", "matilda", 
"groff", "accompanied", "us", "her", "husband", "joseph", "groff", 
"my", "father", "george", "leffler", "and", "my", "uncle", "andrew", 
"henderson", "were", "already", "in", "edmonton", "they", "came", 
"in", "1910", "we", "arrived", "july", "1", "1911", "the", "sun", 
"was", "shining", "when", "we", "arrived", "however", "it", "had", 
"been", "raining", "for", "days", "and", "it", "was", "very", 
"muddy", "especially", "around", "the", "cn", "train"}

searchWords={"the","for","my","and","me","and","we"}

Each of these lists are much longer (say 250 words in the searchWords list and docText being about 12,000 words).

Right now, I have the ability to figure out frequency of a given word by doing something like:

docFrequency=Sort[Tally[docText],#1[[2]]>#2[[2]]&];    
Flatten[Cases[docFrequency,{"settlement",_}]][[2]]

But where I am getting hung up is on my quest to generate specific lists. Specifically, the issue of converting a list of words into a list of the frequency in which those words appear. I've tried to do this with Do loops but have hit a wall.

I want to go through docText with searchWords and replace each element of docText with the sheer frequency of its appearance. I.e. since "settlement" appears twice, it would be replaced by 2 in the list, whereas since "my" appears 3 times, it would become 3. The list would then be something like 2,1,1,1,2, and so forth.

I suspect the answer lies somewhere in If[] and Map[]?

This all sounds weird, but I am trying to pre-process a bunch of information for term frequency information…


Addition for Clarity (I hope):

Here is a better example.

searchWords={"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "A", "about", 
"above", "across", "after", "again", "against", "all", "almost", 
"alone", "along", "already", "also", "although", "always", "among", 
"an", "and", "another", "any", "anyone", "anything", "anywhere", 
"are", "around", "as", "at", "b", "B", "back", "be", "became", 
"because", "become", "becomes", "been", "before", "behind", "being", 
"between", "both", "but", "by", "c", "C", "can", "cannot", "could", 
"d", "D", "do", "done", "down", "during", "e", "E", "each", "either", 
"enough", "even", "ever", "every", "everyone", "everything", 
"everywhere", "f", "F", "few", "find", "first", "for", "four", 
"from", "full", "further", "g", "G", "get", "give", "go", "h", "H", 
"had", "has", "have", "he", "her", "here", "herself", "him", 
"himself", "his", "how", "however", "i", "I", "if", "in", "interest", 
"into", "is", "it", "its", "itself", "j", "J", "k", "K", "keep", "l", 
"L", "last", "least", "less", "m", "M", "made", "many", "may", "me", 
"might", "more", "most", "mostly", "much", "must", "my", "myself", 
"n", "N", "never", "next", "no", "nobody", "noone", "not", "nothing", 
"now", "nowhere", "o", "O", "of", "off", "often", "on", "once", 
"one", "only", "or", "other", "others", "our", "out", "over", "p", 
"P", "part", "per", "perhaps", "put", "q", "Q", "r", "R", "rather", 
"s", "S", "same", "see", "seem", "seemed", "seeming", "seems", 
"several", "she", "should", "show", "side", "since", "so", "some", 
"someone", "something", "somewhere", "still", "such", "t", "T", 
"take", "than", "that", "the", "their", "them", "then", "there", 
"therefore", "these", "they", "this", "those", "though", "three", 
"through", "thus", "to", "together", "too", "toward", "two", "u", 
"U", "under", "until", "up", "upon", "us", "v", "V", "very", "w", 
"W", "was", "we", "well", "were", "what", "when", "where", "whether", 
"which", "while", "who", "whole", "whose", "why", "will", "with", 
"within", "without", "would", "x", "X", "y", "Y", "yet", "you", 
"your", "yours", "z", "Z"}

These are the automatically generated stopwords from WordData[]. So I want to compare these words against docText. Since "settlement" is NOT part of searchWords, then it would appear as 0. But since "my" is part of searchWords, it would pop up as the count (so I could tell how many times the given word appears).

I really do thank you for your help - I'm looking forward to taking some formal courses soon as I'm bumping up against the edge of my ability to really explain what I want to do!

Upvotes: 3

Views: 371

Answers (3)

Mr.Wizard
Mr.Wizard

Reputation: 24336

I set out to solve this in a different way from Szabolcs but I ended up with something rather similar.

Nevertheless, I think it is cleaner. On some data it is faster, on others slower.

docText /. 
  Dispatch[FilterRules[Rule @@@ Tally@docText, searchWords] ~Join~ {_String -> 0}]

Upvotes: 2

Leonid Shifrin
Leonid Shifrin

Reputation: 22579

@Szabolcs gave a fine solution, and I'd probably go the same route myself. Here is a slightly different solution, just for fun:

ClearAll[getFreqs];
getFreqs[docText_, searchWords_] :=
  Module[{dwords, dfreqs, inSearchWords, lset},
    SetAttributes[{lset, inSearchWords}, Listable];
    lset[args__] := Set[args];
    {dwords, dfreqs} = Transpose@Tally[docText];
    lset[inSearchWords[searchWords], True];
    inSearchWords[_] = False;
    dfreqs*Boole[inSearchWords[dwords]]]

This shows how Listable attribute may be used to replace loops and even Map-ping. We have:

In[120]:= getFreqs[docText,searchWords]
Out[120]= {0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,3,1,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,2,
1,0,0,2,0,0,1,0,2,0,2,0,1,1,2,1,1,0,1,0,1,0,0,1,0,0}

Upvotes: 4

Szabolcs
Szabolcs

Reputation: 25703

We can replace everything that doesn't appear in searchWords by 0 in docText as follows:

preprocessedDocText = 
   Replace[docText, 
     Dispatch@Append[Thread[searchWords -> searchWords], _ -> 0], {1}]

The we can replace the remaining words by their frequency:

replaceTable = Dispatch[Rule @@@ Tally[docText]];

preprocessedDocText /. replaceTable

Dispatch preprocesses a list of rules (->) and speeds up replacement significantly in subsequent uses.

I have not benchmarked this on large data, but Dispatch should provide a good speedup.

Upvotes: 7

Related Questions