Word frequency counter for locally stored website

Question

I have a semi-large website locally stored (ripped from the server using httrack). This particular website's directory structure has several folders/subfolders as well as a large number of html files. I would like to know if there are any tools (it really can be anything: scripts, c++/c code, etc) that would allow me to generate a single word frequency counter table across all html files. The trick here is that I am only interested on counting actual content words (i.e., not html code, although these could be easily removed later if that is the case). Any suggestions are much appreciated!

inspectorG4dget · Accepted Answer

Once you strip out the html code, use collections.Counter

>>> sentence = "Hello world. How are you? Hello"
>>> counts = collections.Counter(sentence.split()) # note that this still counts punctuation. Thus, "Hello," and "Hello" are two different words

If you don't have a way of stripping out html, look into lxml to do so

Hope this helps

Word frequency counter for locally stored website

Answers (2)

Related Questions