Cleaning up a scraped HTML List

Question

I'm trying to extract names from a wiki page. Using BeautifulSoup I am able to get a very dirty list (including lots of extraneous items) that I want to clean up, however my attempt to 'sanitise' the list leaves it unchanged.

#1).
#Retreive the page
import requests
from bs4 import BeautifulSoup
weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons'
weapons_page = requests.get(weapons_url)
weapons_soup = BeautifulSoup(weapons_page.content, 'html.parser')

#2).    
#Attain the data I need, plus lot of unhelpful data   
flithy_scraped_weapon_names = weapons_soup.find_all('td', href="", title="")

#3a).
#Identify keywords that reoccur in unhelpful:extraneous list items
dirt = ["mm", "predecessor", "File", "image"]
#3b). - Fails
#Remove extraneous data containing above-defined keywords
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

#4).
#Check data
print(weapon_names_sanitised)
#Returns  a list identical to flithy_scraped_weapon_names

Grismar · Accepted Answer

The problem is in this section:

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

It should instead be:

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in str(s) for xs in dirt)]

The reason is that flithy_scraped_weapon_names contains Tag objects, which will be cast to a string when printed, but need to be explicitly cast to a string for xs in str(s) to work as expected.

Cleaning up a scraped HTML List

Answers (1)

Related Questions