Extract Link URL After Specified Element with Python and Beautifulsoup4

Question

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:

http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php

I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:

import requests
from bs4 import BeautifulSoup

url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')

source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]

print(source_url)

I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.

dot.Py · Accepted Answer

You almost got it!!

SOLUTION 1:

You just have to run the .text method on the soup you've assigned to source_url.

So instead of:

print(source_url)

You should use:

print(source_url.text)

Output:

http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

SOLUTION 2:

You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.

print source_url.get('href')

Output:

http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

Extract Link URL After Specified Element with Python and Beautifulsoup4

Answers (2)

Related Questions