String breaks on square bracket when parsed with lxml

Question

I'm new to lxml parsing, and can't manage a simply parsing issue. The is a line in my xml that looks like:

The IgM BCR is essential for survival of peripheral B cells [34]. In the absence of BTK B cell...

So, when I execute the following code:

e = open('somexml.xml', encoding='utf8')

tree = etree.parse(e)

titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')

for node in titles:
    text = tree.xpath('/pmc-articleset/article/body/sec/p')

    for node in text:
        content = str(node.text).encode("utf-8")
        s = str(' '.join(lxml.html.fromstring(content).xpath("//text()")).encode('latin1'))
        print (s)

the result looks like:

The IgM BCR is essential for survival of peripheral B cells ['

Even if I just print the node.text without any "join" commands the result looks similar.

How can I skip the square brackets part and recieve a full string? Any help will be appreciated!

Josh Voigts · Accepted Answer

Try something along these lines:

e = open('somexml.xml', encoding='utf8')

tree = etree.parse(e)

titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')

for title in titles:
    ps = title.xpath('/pmc-articleset/article/body/sec/p')

    for p in ps:
        text = ''.join(p.itertext())
        print(text)

String breaks on square bracket when parsed with lxml

Answers (2)

Related Questions