Everaldo Aguiar
Everaldo Aguiar

Reputation: 4126

Word frequency counter for locally stored website

I have a semi-large website locally stored (ripped from the server using httrack). This particular website's directory structure has several folders/subfolders as well as a large number of html files. I would like to know if there are any tools (it really can be anything: scripts, c++/c code, etc) that would allow me to generate a single word frequency counter table across all html files. The trick here is that I am only interested on counting actual content words (i.e., not html code, although these could be easily removed later if that is the case). Any suggestions are much appreciated!

Upvotes: 0

Views: 205

Answers (2)

Eric B.
Eric B.

Reputation: 21

See the advanced version of Hermetic Word Frequency Counter at http://www.hermetic.ch/wfca/wfca.htm which scans multiple files and strips out HTML tags. Not free but does a good job of counting words in HTML files. Even does subfolders.

Upvotes: 2

inspectorG4dget
inspectorG4dget

Reputation: 113965

Once you strip out the html code, use collections.Counter

>>> sentence = "Hello world. How are you? Hello"
>>> counts = collections.Counter(sentence.split()) # note that this still counts punctuation. Thus, "Hello," and "Hello" are two different words

If you don't have a way of stripping out html, look into lxml to do so

Hope this helps

Upvotes: 3

Related Questions