Reputation: 118271
I am using BS4 for webpage scraping, and have the below html
:
<a style="display:inline; position:relative;" href="
/aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz">
Screenshot.docx </a>
Now how to get the value of the href
using BS4, couldn't get. Can you help?
Thanks,
Upvotes: 1
Views: 990
Reputation: 80346
doesn't this do the trick?
for a in soup.find_all('a', href=True):
print a['href']
if you need you can use attrs in find_all
:
soup.find_all("div", {"style": "display:inline; position:relative;"})
to strip whitespace and make the link absolute:
import urlparse
urlparse.urljoin(url, a['href'].strip())
Upvotes: 3
Reputation: 7357
for a in soup.find_all('a', {"style": "display:inline; position:relative;"}, href=True):
href = a['href'].strip()
href = "http://example.com" + href
print(href)
'http://example.com/aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz'
The built in function strip()
is very helpful here. :)
Upvotes: 1