alphazwest
alphazwest

Reputation: 4450

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:

http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php

I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:

import requests
from bs4 import BeautifulSoup

url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')

source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]

print(source_url)

I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.

Upvotes: 1

Views: 1643

Answers (2)

dot.Py
dot.Py

Reputation: 5157

You almost got it!!


SOLUTION 1:

You just have to run the .text method on the soup you've assigned to source_url.

So instead of:

print(source_url)

You should use:

print(source_url.text)

Output:

http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense


SOLUTION 2:

You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.

print source_url.get('href')

Output:

http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

Upvotes: 0

Guillaume Thomas
Guillaume Thomas

Reputation: 2310

You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:

source_url.get('href')

Upvotes: 1

Related Questions