Piyush Ghasiya
Piyush Ghasiya

Reputation: 525

I want links and all the content from each link

I searched for a keyword (cybersecurity) on a newspaper website and the results show around 10 articles. I want my code to grab the link and go to that link and get the whole article and repeat this to all the 10 articles in the page. (I don't want the summary, I want the whole article)

import urllib.request
import ssl
import time
from bs4 import BeautifulSoup

ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
    data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
    soup = BeautifulSoup(data, 'html.parser')

    for article in soup.find_all('div', class_="content_col"):
        link = article.p.find('a')
        print(link.attrs['href'])

        for link in links:
            headline = link.h1.find('div', class_= "padding_block")
            headline = headline.text
            print(headline)
            content = link.p.find_all('div', class_= "entry")
            content = content.text
            print(content)

            print()

        time.sleep(3)

This is not working.

date = link.li.find('time', class_= "post_time")

Showing error :

AttributeError: 'NoneType' object has no attribute 'find'

This code is working and grabbing all the articles links. I want to include code that will add headline and content from every article link.

import urllib.request
import ssl
import time
from bs4 import BeautifulSoup

ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:

    data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))

    soup = BeautifulSoup(data, 'html.parser')

    for article in soup.find_all('div', class_="content_col"):
        link = article.p.find('a')
        print(link.attrs['href'])
        print()
        time.sleep(3)

Upvotes: 1

Views: 85

Answers (1)

SIM
SIM

Reputation: 22440

Try the following script. It will fetch you all the titles along with their content. Put the highest number of pages you wanna go across.

import requests
from bs4 import BeautifulSoup

url = 'https://www.japantimes.co.jp/tag/cybersecurity/page/{}'

pages = 4

for page in range(1,pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".content_col header p > a"):
        resp = requests.get(item.get("href"))
        sauce = BeautifulSoup(resp.text,"lxml")
        title = sauce.select_one("header h1").text
        content = [elem.text for elem in sauce.select("#jtarticle p")]
        print(f'{title}\n{content}\n')

Upvotes: 2

Related Questions