Scraping through all pages

Question

I am trying to scrape this websites: voxnews.info

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

web='https://voxnews.info'
def main(req, num, web):
    r = req.get(web+"/page/{}/".format(num))
    soup = BeautifulSoup(r.content, 'html.parser')
    goal = [(x.time.text, x.h1.a.get_text(strip=True), x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
           for x in soup.select("div.site-content")]

    return goal


with ThreadPoolExecutor(max_workers=30) as executor:
    with requests.Session() as req:
        fs = [executor.submit(main, req, num) for num in range(1, 2)] # need to scrape all the webpages in the website
        allin = []
        for f in fs:
            allin.extend(f.result())
        df = pd.DataFrame.from_records(
            allin, columns=["Date", "Title", "Category", "Content"])
        print(df)

But the code has two problems:

the first one is that I am not scraping all the pages (I currently put 1 and 2 in the range, but I would need all the pages);
it does not save correctly the dates.

If could have a look at the code and tell me how to improve it in order to fix these two issues,it would be awesome.

goalie1998 · Accepted Answer

Some minor changes.

First it isn't necessary to use requests.Session() for single requests - you aren't trying to save data between requests.

A minor change to how you had your with statement, I don't know if it's more correct, or just how I do it, you don't need all of the code to run with the executer still open.

I gave you two options for parsing the date, either as it's written on the site, a string in Italian, or as a datetime object.

I didn't see any "p" tag within the articles, so I removed that part. It seems in order to get the "content" of the articles, you would have to actually navigate to and scrape them individually. I removed that line from the code.

In your original code, you weren't getting every single article on the page, just the first one of each. There is only one "div.site-content" tag per page, but multiple "article" tags. That's what that change is.

And finally, I prefer find over select, but that's just my style choice. This worked for me for the first three pages, I didn't try the entire site. Be careful when you do run this, 78 blocks of 30 requests might get you blocked...

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import datetime


def main(num, web):
    r = requests.get(web+"/page/{}/".format(num))
    soup = BeautifulSoup(r.content, 'html.parser')
    html = soup.find("div", class_="site-content")
    articles = html.find_all("article")
    
    # Date as string In italian
    goal = [(x.time.get_text(), x.h1.a.get_text(strip=True), x.find("span", class_="cat-links").get_text(strip=True)) for x in articles]
    # OR as datetime object
    goal = [(datetime.datetime.strptime(x.time["datetime"], "%Y-%m-%dT%H:%M:%S%z"), x.h1.a.get_text(strip=True), x.find("span", class_="cat-links").get_text(strip=True)) for x in articles]

    return goal


web='https://voxnews.info'

r = requests.get(web)
soup = BeautifulSoup(r.text, "html.parser")
last_page = soup.find_all("a", class_="page-numbers")[1].get_text()
last_int = int(last_page.replace(".",""))

### BE CAREFUL HERE WITH TESTING, DON'T USE ALL 2,320 PAGES ###
with ThreadPoolExecutor(max_workers=30) as executor:
    fs = [executor.submit(main, num, web) for num in range(1, last_int)]

allin = []
for f in fs:
    allin.extend(f.result())
df = pd.DataFrame.from_records(
    allin, columns=["Date", "Title", "Category"])
print(df)

Scraping through all pages

Answers (2)

Related Questions