Reputation: 15
I have bunch of html documents 10-15 on which i have to apply LDA algorithm in gensim I am stuck on creating the corpus as i don't understand how i design a corpus for a collection of html documents. The example on the site shows the creation of them on wikipedia compressed file .xml.bz
Anyone please guide me how can i apply LDA on bunch of html documents. Thanks in advance
Upvotes: 1
Views: 317
Reputation: 4276
Check out HTML processing libraries, like lxml
or beautifulsoup
.
For higher level processing (removal of boilerplate, extracting plain text from HTML), have a look at e.g. Honza Pomikalek's jusText package.
Once you have plain text documents, you can proceed as per gensim's tutorials.
Upvotes: 1