user2483201
user2483201

Reputation:

Grab url from href and text

I have tried using regex but read around and got directed to beautiful soup...

I've kinda figured out how to get urls in html tags with soup, but how would I grab urls from both html tags (href=*) and the body text of the page?

Also for grabbing the ones in tags, how do I specify that I only want urls starting with http://, https://... ?

Thanks in advance!

Upvotes: 1

Views: 346

Answers (1)

fortran
fortran

Reputation: 80

First look at parsing-html-in-python-lxml-or-beautifulsoup. I read it and never looked at the soup. I guess because I find lxml so easy. I am sure there are different ways to do what you asked, perhaps there are easier ones. But I'll show what I use.

In lxml you can use XPath it's like using regex for XML/HTML. This code below will find all "a" tags that have "http" attribute and print all links that start with http. This should help you get started on your parsing.

from lxml.html import etree

tree = etree.parse("my.html", etree.HTMLParser())
root = tree.getroot()
links = root.findall('*//a[@href]')
foreach link in links:
    if link.get("http").startswith("http"):
        print link.get("http")

Upvotes: 1

Related Questions