Reputation: 107
I'm new to lxml parsing, and can't manage a simply parsing issue. The is a line in my xml that looks like:
The IgM BCR is essential for survival of peripheral B cells [<xref ref-type="bibr" rid="CR34">34</xref>]. In the absence of BTK B cell...
So, when I execute the following code:
e = open('somexml.xml', encoding='utf8')
tree = etree.parse(e)
titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')
for node in titles:
text = tree.xpath('/pmc-articleset/article/body/sec/p')
for node in text:
content = str(node.text).encode("utf-8")
s = str(' '.join(lxml.html.fromstring(content).xpath("//text()")).encode('latin1'))
print (s)
the result looks like:
The IgM BCR is essential for survival of peripheral B cells ['
Even if I just print the node.text without any "join" commands the result looks similar.
How can I skip the square brackets part and recieve a full string? Any help will be appreciated!
Upvotes: 2
Views: 553
Reputation: 50957
]. In the absence of BTK B cell...
is the value of the tail
property of the <xref>
element. See http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html.
There is nothing special about the square brackets; they are just characters.
With itertext()
you can get the text content of an element and its descendants. tail
content is included by default. See http://lxml.de/api/lxml.etree._Element-class.html#itertext.
Small demo:
from lxml import etree
xml = "<p>TEXT <xref>34</xref>TAIL</p>"
p = etree.fromstring(xml)
print(p.text)
print(''.join(p.itertext()))
print(p.text + p.find("xref").tail)
Output:
TEXT
TEXT 34TAIL
TEXT TAIL
Upvotes: 3
Reputation: 4132
Try something along these lines:
e = open('somexml.xml', encoding='utf8')
tree = etree.parse(e)
titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')
for title in titles:
ps = title.xpath('/pmc-articleset/article/body/sec/p')
for p in ps:
text = ''.join(p.itertext())
print(text)
Upvotes: 0