jason
jason

Reputation: 2106

remove newline and whitespace parse XML with python Xpath

Here is the xml file http://www.diveintopython3.net/examples/feed.xml

My code is enter image description here

My result is enter image description here

My questions are

  1. how to remove the \n and the following white space in the text

  2. how to get the node whose text is "dive into mark", how to search the text syntax

Upvotes: 3

Views: 4256

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

Just call normalize-space(.) on each node.

import lxml.etree as et

xml = et.parse("feed.xml")
ns = {"ns": 'http://www.w3.org/2005/Atom'}
for n in xml.xpath("//ns:category", namespaces=ns):
    t  = n.xpath("./../ns:summary", namespaces=ns)[0]
    print(t.xpath("normalize-space(.)"))

Output:

Putting an entire chapter on one page sounds bloated, but consider this — my longest chapter so far would be 75 printed pages, and it loads in under 5 seconds… On dialup.
Putting an entire chapter on one page sounds bloated, but consider this — my longest chapter so far would be 75 printed pages, and it loads in under 5 seconds… On dialup.
Putting an entire chapter on one page sounds bloated, but consider this — my longest chapter so far would be 75 printed pages, and it loads in under 5 seconds… On dialup.
The accessibility orthodoxy does not permit people to question the value of features that are rarely useful and rarely used.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.

All your newlines have been removed and multiple spaces replaced with a single space.

Part two of your question is asking for the title tag as that is the only tag with the text you are looking for, but to specifically find the title with that exact text, that is simply:

xml.xpath("//ns:title[text()='dive into mark']", namespaces=ns)

If you wanted any node that contained the text, you would just replace ns:title with a wildcard:

xml.xpath("//*[text()='dive into mark']", namespaces=ns)

Upvotes: 2

Related Questions