Reputation: 808
I'm using lxml XPath to parse the following xml file
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>
https://www.reuters.com/article/us-campbellsoup-thirdpoint/campbell-soup-nears-deal-with-third-point-to-end-board-challenge-sources-idUSKCN1NU11I
</loc>
<image:image>
<image:loc>
https://www.reuters.com/resources/r/?m=02&d=20181126&t=2&i=1328589868&w=&fh=&fw=&ll=460&pl=300&r=LYNXNPEEAO0WM
</image:loc>
</image:image>
<news:news>
<news:publication>
<news:name>Reuters</news:name>
<news:language>eng</news:language>
</news:publication>
<news:publication_date>2018-11-26T02:55:00+00:00</news:publication_date>
<news:title>
Campbell Soup nears deal with Third Point to end board challenge: sources
</news:title>
<news:keywords>Headlines,Business, Industry</news:keywords>
<news:stock_tickers>NYSE:CPB</news:stock_tickers>
</news:news>
</url>
</urlset>
Python code sample
import lxml.etree
import lxml.html
import requests
def main():
r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
namespace = "http://www.google.com/schemas/sitemap-news/0.9"
root = lxml.etree.fromstring(r.content)
records = root.xpath('//news:title', namespaces = {"news": "http://www.google.com/schemas/sitemap-news/0.9"})
for record in records:
print(record.text)
records = root.xpath('//sitemap:loc', namespaces = {"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9"})
for record in records:
print(record.text)
if __name__ == "__main__":
main()
Currently, I'm XPath to get all URL and title, but this is not what I want because I don't know which URL belongs to which title. My question is how to get each <url>
, then loop each <url>
as item to get corresponding <loc>
and <news:keywords>
etc. Thanks!
Edit: Expecting output
foreach <url>
get <loc>
get <news:publication_date>
get <news:title>
Upvotes: 2
Views: 2083
Reputation: 479
The answer is
from datetime import datetime
from html import unescape
from lxml import etree
import requests
r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = etree.fromstring(r.content)
ns = {
"news": "http://www.google.com/schemas/sitemap-news/0.9",
"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
"image": "http://www.google.com/schemas/sitemap-image/1.1"
}
for url in root.iterfind("sitemap:url", namespaces=ns):
loc = url.findtext("sitemap:loc", namespaces=ns)
print(loc)
title = unescape(url.findtext("news:news/news:title", namespaces=ns))
print(title)
date = unescape(url.findtext("news:news/news:publication_date", namespaces=ns))
date = datetime.strptime(date, '%Y-%m-%dT%H:%M:%S+00:00')
print(date)
The rules of thumb are:
Try not use xpath. Instead of using xpath, use find, findall, iterfind. xpath is a more complex algorithm than just find, findall or iterfind and it takes more time and resources.
Use iterfind instead of using findall. Because iterfind will yield return the items. That is to say it will return one item at a time. Thus it uses less memory.
Use findtext if all you need is text.
A more general rule is to read the official document.
Firstly, let's create 3 for-loop function and compare them.
def for1():
for url in root.iterfind("sitemap:url", namespaces=ns):
pass
def for2():
for url in root.findall("sitemap:url", namespaces=ns):
pass
def for3():
for url in root.xpath("sitemap:url", namespaces=ns):
pass
function | time |
---|---|
root.iterfind |
70.5 µs ± 543 ns |
root.findall |
72.3 µs ± 839 ns |
root.xpath |
84.8 µs ± 567 ns |
We can see that iterfind is the fastest as expected.
Next, let's check the statements inside the for loop.
statement | time |
---|---|
url.xpath('string(news:news/news:title)', namespaces=ns) |
15.7 µs ± 112 ns |
url_item.xpath('news:news/news:title', namespaces=ns)[0].text |
14.4 µs ± 53.7 ns |
url_item.find('news:news/news:title', namespaces=ns).text |
3.74 µs ± 60 ns |
url_item.findtext('news:news/news:title', namespaces=ns) |
3.71 µs ± 40.3 ns |
From the above table, we can see that find/findtext is 4 times faster than xpath. And findtext is even faster than find.
This answer takes only 3.41 ms ± 53 µs, compared to Tomalak's 8.33 ms ± 52.4 µs
Upvotes: 0
Reputation: 338108
Use relative XPath to get from each title to its associated URL:
ns = {
"news": "http://www.google.com/schemas/sitemap-news/0.9",
"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
"image": "http://www.google.com/schemas/sitemap-image/1.1"
}
r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = lxml.etree.fromstring(r.content)
for title in root.xpath('//news:title', namespaces=ns):
print(title.text)
loc = title.xpath('ancestor::sitemap:url/sitemap:loc', namespaces=ns)
print(loc[0].text)
Exercise: Rewrite this to get from the URL to the associated title instead.
Note: The titles (and potentially the URLs as well) seem to be HTML-escaped. Use the unescape()
function
from html import unescape
to unescape them.
Upvotes: 2