Reputation: 704
I am trying to scrape this websites: voxnews.info
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
web='https://voxnews.info'
def main(req, num, web):
r = req.get(web+"/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.time.text, x.h1.a.get_text(strip=True), x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
for x in soup.select("div.site-content")]
return goal
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, num) for num in range(1, 2)] # need to scrape all the webpages in the website
allin = []
for f in fs:
allin.extend(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Category", "Content"])
print(df)
But the code has two problems:
If could have a look at the code and tell me how to improve it in order to fix these two issues,it would be awesome.
Upvotes: 0
Views: 92
Reputation: 99
In order to fetch results from all pages, not just one or ten pages (i.e. hardcoded), the best solution is to use an infinite while
loop and test for something (button, element) that will cause it to exit.
This solution is better than a hardcoded for
loop since the while
loop will go through all pages no matter how many there are until a certain condition is fulfilled. In our case, this is the presence of a button on the page (.next
CSS selector):
if soup.select_one(".next"):
page_num += 1
else:
break
You can also add a limit on the number of pages, upon reaching which the cycle will also stop:
limit = 20 # paginate through 20 pages
if page_num == limit:
break
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
data = []
page_num = 1
limit = 20 # page limit
while True:
html = requests.get(f"https://voxnews.info/page/{page_num}", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
print(f"Extracting page: {page_num}")
print("-" * 10)
for result in soup.select(".entry-header"):
title = result.select_one(".entry-title a").text
category = result.select_one(".entry-meta:nth-child(1)").text.strip()
date = result.select_one(".entry-date").text
data.append({
"title": title,
"category": category,
"date": date
})
# Condition for exiting the loop when the specified number of pages is reached.
if page_num == limit:
break
if soup.select_one(".next"):
page_num += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Italia invasa dai figli degli immigrati: “Italiani pezzi di merda” – VIDEO",
"category": "BREAKING NEWS, INVASIONE, MILANO, VIDEO",
"date": "Novembre 23, 2022"
},
{
"title": "Soumahoro accusato di avere fatto sparire altri 200mila euro – VIDEO",
"category": "BREAKING NEWS, POLITICA, VIDEO",
"date": "Novembre 23, 2022"
},
{
"title": "Città invase da immigrati: “Qui comandiamo noi” – VIDEO",
"category": "BREAKING NEWS, INVASIONE, VENEZIA, VIDEO",
"date": "Novembre 23, 2022"
},
# ...
]
There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.
Upvotes: 0
Reputation: 1432
Some minor changes.
First it isn't necessary to use requests.Session() for single requests - you aren't trying to save data between requests.
A minor change to how you had your with
statement, I don't know if it's more correct, or just how I do it, you don't need all of the code to run with the executer still open.
I gave you two options for parsing the date, either as it's written on the site, a string in Italian, or as a datetime object.
I didn't see any "p" tag within the articles, so I removed that part. It seems in order to get the "content" of the articles, you would have to actually navigate to and scrape them individually. I removed that line from the code.
In your original code, you weren't getting every single article on the page, just the first one of each. There is only one "div.site-content" tag per page, but multiple "article" tags. That's what that change is.
And finally, I prefer find over select, but that's just my style choice. This worked for me for the first three pages, I didn't try the entire site. Be careful when you do run this, 78 blocks of 30 requests might get you blocked...
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import datetime
def main(num, web):
r = requests.get(web+"/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
html = soup.find("div", class_="site-content")
articles = html.find_all("article")
# Date as string In italian
goal = [(x.time.get_text(), x.h1.a.get_text(strip=True), x.find("span", class_="cat-links").get_text(strip=True)) for x in articles]
# OR as datetime object
goal = [(datetime.datetime.strptime(x.time["datetime"], "%Y-%m-%dT%H:%M:%S%z"), x.h1.a.get_text(strip=True), x.find("span", class_="cat-links").get_text(strip=True)) for x in articles]
return goal
web='https://voxnews.info'
r = requests.get(web)
soup = BeautifulSoup(r.text, "html.parser")
last_page = soup.find_all("a", class_="page-numbers")[1].get_text()
last_int = int(last_page.replace(".",""))
### BE CAREFUL HERE WITH TESTING, DON'T USE ALL 2,320 PAGES ###
with ThreadPoolExecutor(max_workers=30) as executor:
fs = [executor.submit(main, num, web) for num in range(1, last_int)]
allin = []
for f in fs:
allin.extend(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Category"])
print(df)
Upvotes: 1