Reputation: 525
I searched for a keyword (cybersecurity) on a newspaper website and the results show around 10 articles. I want my code to grab the link and go to that link and get the whole article and repeat this to all the 10 articles in the page. (I don't want the summary, I want the whole article)
import urllib.request
import ssl
import time
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
soup = BeautifulSoup(data, 'html.parser')
for article in soup.find_all('div', class_="content_col"):
link = article.p.find('a')
print(link.attrs['href'])
for link in links:
headline = link.h1.find('div', class_= "padding_block")
headline = headline.text
print(headline)
content = link.p.find_all('div', class_= "entry")
content = content.text
print(content)
print()
time.sleep(3)
This is not working.
date = link.li.find('time', class_= "post_time")
Showing error :
AttributeError: 'NoneType' object has no attribute 'find'
This code is working and grabbing all the articles links. I want to include code that will add headline and content from every article link.
import urllib.request
import ssl
import time
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
soup = BeautifulSoup(data, 'html.parser')
for article in soup.find_all('div', class_="content_col"):
link = article.p.find('a')
print(link.attrs['href'])
print()
time.sleep(3)
Upvotes: 1
Views: 85
Reputation: 22440
Try the following script. It will fetch you all the titles along with their content. Put the highest number of pages you wanna go across.
import requests
from bs4 import BeautifulSoup
url = 'https://www.japantimes.co.jp/tag/cybersecurity/page/{}'
pages = 4
for page in range(1,pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".content_col header p > a"):
resp = requests.get(item.get("href"))
sauce = BeautifulSoup(resp.text,"lxml")
title = sauce.select_one("header h1").text
content = [elem.text for elem in sauce.select("#jtarticle p")]
print(f'{title}\n{content}\n')
Upvotes: 2