Data1234
Data1234

Reputation: 97

Link extraction from website

I am trying to extract some data from WebMD and once I run my code I keep geeting a "None" as a return. Any idea what I am doing wrong. I have the number of returns the same as the number of links but I do not get the links.

import bs4 as bs
import urllib.request
import pandas as pd


source = urllib.request.urlopen('https://messageboards.webmd.com/').read()

soup = bs.BeautifulSoup(source,'lxml')

for url in soup.find_all('div',class_="link"):
    print (url.get('href'))

Upvotes: 1

Views: 71

Answers (2)

evsheino
evsheino

Reputation: 2277

soup.find_all('div',class_="link") returns all div elements with the class link. These elements wrap the a elements that contain the href attributes, so you need to get the href from the correct element like so:

for div in soup.find_all('div',class_="link"):
    print (div.a.get('href'))

Upvotes: 0

brianpck
brianpck

Reputation: 8254

Your url element is actually a div tag, not an a:

>>> x = soup.find_all('div', class_="link")
>>> x[0]
<div class="link"><a href="https://messageboards.webmd.com/family-pregnancy/f/relationships/">Relationships</a></div>

You need to select the child before getting the href attribute:

>>> x[0].a.get('href')
'https://messageboards.webmd.com/family-pregnancy/f/relationships/'

Just modify your for loop as follows:

for url in soup.find_all('div',class_="link"):
    print (url.a.get('href'))

Upvotes: 1

Related Questions