Khrystyna Pyurkovska
Khrystyna Pyurkovska

Reputation: 99

xpath for parsing in Python

I've written a simple parser in Python for this website. Below is part of my code.
My questions are:

  1. How could I extract not only p[1] but also the rest p[2],p[3]...
  2. How Can I separate them from each other?

text1 = xmldata.xpath('//p[@class="MsoNormal"][1]//text()')  
a=''  
for i in text1:  
a=a+i.encode('cp1251')  
print a

Upvotes: 1

Views: 351

Answers (3)

Charles Duffy
Charles Duffy

Reputation: 295815

Simply remove the [1] to stop filtering, and your return value will be a list, which you can pass to ''.join() to concatenate (or '\n'.join() if you want newlines between each string).

text_sections = xmldata.xpath('//p[@class="MsoNormal"]//text()')
print u'\n'.join(text_sections).encode('cp1251')

Upvotes: 2

paul trmbrth
paul trmbrth

Reputation: 20748

You can use lxml.html.parse() function that accepts file-like objects, such as what urllib.urlopen() returns. See lxml documentation on that.

Then, as @CharlesDuffy suggests, you can use u'\n'.join() to concatenate all text elements within the p elements you select, with newlines \n

Also, I would suggest working with unicode strings all along, until you need to print or write to file.

import urllib
import lxml.html

page = urllib.urlopen('http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=1&Itemid=2')

# use "page" as a file-like object
xmldata = lxml.html.parse(page).getroot()

ptexts = xmldata.xpath('//p[@class="MsoNormal"]//text()')
joined_text = u'\n'.join(ptexts)

print joined_text.encode('cp1251')

Upvotes: 1

sukhmel
sukhmel

Reputation: 1492

without knowing of any background, I can suggest only such:

texts = list();
index = 0;
while(True):
    index += 1;
    try:
        temp = xmldata.xpath('//p[@class="MsoNormal"][%i]//text()' % index);
    except:
        break;
    else:
        texts.append();

after this block of code you will have a list of same elements as your text1

Upvotes: 0

Related Questions