Reputation: 4126
I have a semi-large website locally stored (ripped from the server using httrack). This particular website's directory structure has several folders/subfolders as well as a large number of html files. I would like to know if there are any tools (it really can be anything: scripts, c++/c code, etc) that would allow me to generate a single word frequency counter table across all html files. The trick here is that I am only interested on counting actual content words (i.e., not html code, although these could be easily removed later if that is the case). Any suggestions are much appreciated!
Upvotes: 0
Views: 205
Reputation: 21
See the advanced version of Hermetic Word Frequency Counter at http://www.hermetic.ch/wfca/wfca.htm which scans multiple files and strips out HTML tags. Not free but does a good job of counting words in HTML files. Even does subfolders.
Upvotes: 2
Reputation: 113965
Once you strip out the html code, use collections.Counter
>>> sentence = "Hello world. How are you? Hello"
>>> counts = collections.Counter(sentence.split()) # note that this still counts punctuation. Thus, "Hello," and "Hello" are two different words
If you don't have a way of stripping out html, look into lxml to do so
Hope this helps
Upvotes: 3