Arup Rakshit
Arup Rakshit

Reputation: 118271

Couldn't extract out the href values form the <a> tags using BS4

I am using BS4 for webpage scraping, and have the below html :

<a style="display:inline; position:relative;" href="

                                      /aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz">
                                Screenshot.docx                      </a>

Now how to get the value of the href using BS4, couldn't get. Can you help?

Thanks,

Upvotes: 1

Views: 990

Answers (2)

root
root

Reputation: 80346

doesn't this do the trick?

for a in soup.find_all('a', href=True):
    print a['href']

if you need you can use attrs in find_all:

soup.find_all("div", {"style": "display:inline; position:relative;"})

to strip whitespace and make the link absolute:

import urlparse
urlparse.urljoin(url, a['href'].strip())

Upvotes: 3

Fredrick Brennan
Fredrick Brennan

Reputation: 7357

for a in soup.find_all('a', {"style": "display:inline; position:relative;"}, href=True):
    href = a['href'].strip()
    href = "http://example.com" + href
print(href)

'http://example.com/aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz'

The built in function strip() is very helpful here. :)

Upvotes: 1

Related Questions