Reputation: 99
My task is to parse an HTML page (in cyrillic) and to extract certain words. Here's a web page I have to parse: http://www.toponymic-dictionary.in.ua/. I only got the page:
import urllib
from lxml.html import fromstring
url = 'http://www.toponymic-dictionary.in.ua/'
content = urllib.urlopen(url).read()
doc = fromstring(content)
doc.make_links_absolute(url)
The HTML code is quite complicated for me (to use xpath), so I don't know how to proceed into parsing.
Upvotes: 1
Views: 227
Reputation: 2867
Have a look this library: BeautifulSoup
And its Documentation
It fits best for your requirement.
Cheers!
Upvotes: 1