Reputation:
I have tried using regex but read around and got directed to beautiful soup...
I've kinda figured out how to get urls in html tags with soup, but how would I grab urls from both html tags (href=*) and the body text of the page?
Also for grabbing the ones in tags, how do I specify that I only want urls starting with http://, https://... ?
Thanks in advance!
Upvotes: 1
Views: 346
Reputation: 80
First look at parsing-html-in-python-lxml-or-beautifulsoup. I read it and never looked at the soup. I guess because I find lxml so easy. I am sure there are different ways to do what you asked, perhaps there are easier ones. But I'll show what I use.
In lxml you can use XPath it's like using regex for XML/HTML. This code below will find all "a" tags that have "http" attribute and print all links that start with http. This should help you get started on your parsing.
from lxml.html import etree
tree = etree.parse("my.html", etree.HTMLParser())
root = tree.getroot()
links = root.findall('*//a[@href]')
foreach link in links:
if link.get("http").startswith("http"):
print link.get("http")
Upvotes: 1