Reputation: 99
I've written a simple parser in Python for this website. Below is part of my code.
My questions are:
p[1]
but also the rest p[2]
,p[3]
... text1 = xmldata.xpath('//p[@class="MsoNormal"][1]//text()')
a=''
for i in text1:
a=a+i.encode('cp1251')
print a
Upvotes: 1
Views: 351
Reputation: 295815
Simply remove the [1]
to stop filtering, and your return value will be a list, which you can pass to ''.join()
to concatenate (or '\n'.join()
if you want newlines between each string).
text_sections = xmldata.xpath('//p[@class="MsoNormal"]//text()')
print u'\n'.join(text_sections).encode('cp1251')
Upvotes: 2
Reputation: 20748
You can use lxml.html.parse()
function that accepts file-like objects, such as what urllib.urlopen()
returns. See lxml documentation on that.
Then, as @CharlesDuffy suggests, you can use u'\n'.join()
to concatenate all text elements within the p
elements you select, with newlines \n
Also, I would suggest working with unicode strings all along, until you need to print or write to file.
import urllib
import lxml.html
page = urllib.urlopen('http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=1&Itemid=2')
# use "page" as a file-like object
xmldata = lxml.html.parse(page).getroot()
ptexts = xmldata.xpath('//p[@class="MsoNormal"]//text()')
joined_text = u'\n'.join(ptexts)
print joined_text.encode('cp1251')
Upvotes: 1
Reputation: 1492
without knowing of any background, I can suggest only such:
texts = list();
index = 0;
while(True):
index += 1;
try:
temp = xmldata.xpath('//p[@class="MsoNormal"][%i]//text()' % index);
except:
break;
else:
texts.append();
after this block of code you will have a list of same elements as your text1
Upvotes: 0