Yashit Garg
Yashit Garg

Reputation: 13

Extracting links from XML file using python

I have a sitemap XML file and I want to run a script which extracts all the urls and print it. I have tried re.findall(r'(https?://\S+)', url)

but this prints the closing tags like : "https://www.tutorialspoint.com/python/python_reg_expressions.htm /liv"

I don't want to print the suffix ' /liv ' how do I implement this using regex ?

Upvotes: 1

Views: 279

Answers (1)

wizzwizz4
wizzwizz4

Reputation: 6426

Are all of the URLs wrapped in quotation marks or surrounded by spaces? If so, you could do something like:

re.findall(r'(?P<quote>.)(https?://\S+?)(?P=quote)', url)

If you're getting the string representation of everything matched, instead of just the second group, you'll have to trim it with ...[1:-1].

Upvotes: 1

Related Questions