Problems with tags not correctly implemented

Question

I am not familiar with scraping techniques, but I would need to get information on authors, titles, dates from a website. I tried to write some code following tutorials and previous questions on Stackoverflow, but I still have difficulties in selecting tags. I did as follows:

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from itertools import repeat
import re
import pandas as pd


def main(req, num):
    # req.get("https://www.politifact.com/factchecks/list/?page=".format(num)+"&ruling=false")
    r = req.get("https://www.politifact.com/factchecks/list/?ruling=false/page={}/".format(num))
    
    soup = BeautifulSoup(r.content, 'html.parser')
    for x in soup.findAll('main', {'class',"global-content"}):
        print('
',x.select_one("div > a:m-statemet__author").text, sep='
') #heading ok
        print('
',x.select_one("div > a:m-m-statement__desc").text, sep='
') 
        print('
',x.select_one("div > a:m-statement__quote").text, sep='
') 
        print('
', x.select_one('div > footer:m-statement__footer').text, sep='
') 
    return [(x.select_one("div > a:m-statemet__author").text, x.select_one("div > a:m-m-statement__desc").text, x.select_one("div > a:m-statement__quote").text, x.select_one('div > footer:m-statement__footer').text) for x in soup.findAll('main', {'class',"global-content"})]

with ThreadPoolExecutor(max_workers=50) as executor:
    with requests.Session() as req:
        fs = executor.map(main, repeat(req), range(1, 100)) # I would like to scrape all the pages, if possible
        final = []
        [final.extend(f) for f in fs]
        df = pd.DataFrame.from_records(
            final, columns=["Author", "Date", "Title","Fact_Date"])

Getting the error:

NotImplementedError: ':m-statemet__author' pseudo-class is not implemented at this time

I understand that the problem is in tags selection. I would like appreciate if you could point me on the right tags.

Arthur Pereira · Accepted Answer

I'll try to explain myself, so feel free to ask if I'm not clear.

Fist you are passing the wrong url on this code sample. I reformated the code with the right one you provided on the comments.

Then, once you got the page source you have to select the tags to scrape. I did this by getting all the containers with find_all. Once there I could loop through every item and look for the information we need. In this case we need the type, date, title and authors. We append this informations to our data frame and go to the next item.

At the end we have one df with all the information.

Here is the code:

from bs4 import BeautifulSoup
import requests
import time
import pandas as pd

page_start = 1
n_pages = 10
seconds_pause = 2

df = pd.DataFrame(columns=['Type', 'Date', 'Title', 'Author'])

for page_number in range(page_start, n_pages + 1):
    try:
        page_url = 'https://www.politifact.com/factchecks/list/?page={}&ruling=false'.format(page_number)
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, 'html.parser')
        containers = soup.find_all("li", class_="o-listicle__item")

        for item in containers:
            scraped_info = {'Type': item.find("a", class_="m-statement__name").text.strip(),
                             'Date': item.find("div", class_="m-statement__desc").text.strip(),
                             'Title': item.find("div", class_="m-statement__quote").text.strip(),
                             'Author': item.find("footer", class_="m-statement__footer").text.strip()}
            df = df.append(scraped_info, ignore_index=True)

        time.sleep(seconds_pause)
    except Exception as e:
        print('
failed to scrape page {}
{}
'.format(page_number, e))

print(df)

Problems with tags not correctly implemented

Answers (2)

Related Questions