VillaRava
VillaRava

Reputation: 167

Follow links with BeautifulSoup4

I'm using Python to extract links from a page:

for link in soup.find_all('a', href=True):
    if 'http' in link['href']:
        links.append(link['href'])

How do I construct something that opens each link and extracts text from say "p" tags on the linked pages?

Upvotes: 0

Views: 1178

Answers (2)

Dušan Maďar
Dušan Maďar

Reputation: 9909

You can use requests to get the HTML for collected links and then parse it with BeautifulSoup.

import requests
from bs4 import BeautifulSoup

# get links
for link in soup.find_all('a', href=True):
    if link['href'].startswith('http'):
        links.append(link['href'])

# visit links and print paragraphs text
for link in links:
    response = requests.get(link)

   soup = BeautifulSoup(response.content, 'html.parser')

   for p in soup.find_all('p'):
         print p.text

Or without two iterations over links

import requests
from bs4 import BeautifulSoup

# get links
for link in soup.find_all('a', href=True):
    if link['href'].startswith('http'):
        response = requests.get(link['href'])

         soup = BeautifulSoup(response.content, 'html.parser')

         for p in soup.find_all('p'):
             print p.text

Upvotes: 1

bmcculley
bmcculley

Reputation: 2088

You could change the way you get the original links, maybe something like:

links = soup.find_all('a', href=True)

for link in links:
    # code to create soup of the current link html
    if 'http' in link['href']:
        links.append(link['href'])

It would then continue on to the newly added links until completion.

Upvotes: 0

Related Questions