Reputation: 167
I'm using Python to extract links from a page:
for link in soup.find_all('a', href=True):
if 'http' in link['href']:
links.append(link['href'])
How do I construct something that opens each link and extracts text from say "p" tags on the linked pages?
Upvotes: 0
Views: 1178
Reputation: 9909
You can use requests
to get the HTML for collected links and then parse it with BeautifulSoup
.
import requests
from bs4 import BeautifulSoup
# get links
for link in soup.find_all('a', href=True):
if link['href'].startswith('http'):
links.append(link['href'])
# visit links and print paragraphs text
for link in links:
response = requests.get(link)
soup = BeautifulSoup(response.content, 'html.parser')
for p in soup.find_all('p'):
print p.text
Or without two iterations over links
import requests
from bs4 import BeautifulSoup
# get links
for link in soup.find_all('a', href=True):
if link['href'].startswith('http'):
response = requests.get(link['href'])
soup = BeautifulSoup(response.content, 'html.parser')
for p in soup.find_all('p'):
print p.text
Upvotes: 1
Reputation: 2088
You could change the way you get the original links, maybe something like:
links = soup.find_all('a', href=True)
for link in links:
# code to create soup of the current link html
if 'http' in link['href']:
links.append(link['href'])
It would then continue on to the newly added links until completion.
Upvotes: 0