Reputation: 13
I have a sitemap XML file and I want to run a script which extracts all the urls and print it. I have tried re.findall(r'(https?://\S+)', url)
but this prints the closing tags like : "https://www.tutorialspoint.com/python/python_reg_expressions.htm /liv"
I don't want to print the suffix ' /liv ' how do I implement this using regex ?
Upvotes: 1
Views: 279
Reputation: 6426
Are all of the URLs wrapped in quotation marks or surrounded by spaces? If so, you could do something like:
re.findall(r'(?P<quote>.)(https?://\S+?)(?P=quote)', url)
If you're getting the string representation of everything matched, instead of just the second group, you'll have to trim it with ...[1:-1]
.
Upvotes: 1