Reputation: 533
I'm trying to extract names from a wiki page. Using BeautifulSoup I am able to get a very dirty list (including lots of extraneous items) that I want to clean up, however my attempt to 'sanitise' the list leaves it unchanged.
#1).
#Retreive the page
import requests
from bs4 import BeautifulSoup
weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons'
weapons_page = requests.get(weapons_url)
weapons_soup = BeautifulSoup(weapons_page.content, 'html.parser')
#2).
#Attain the data I need, plus lot of unhelpful data
flithy_scraped_weapon_names = weapons_soup.find_all('td', href="", title="")
#3a).
#Identify keywords that reoccur in unhelpful:extraneous list items
dirt = ["mm", "predecessor", "File", "image"]
#3b). - Fails
#Remove extraneous data containing above-defined keywords
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in s for xs in dirt)]
#4).
#Check data
print(weapon_names_sanitised)
#Returns a list identical to flithy_scraped_weapon_names
Upvotes: 0
Views: 63
Reputation: 31319
The problem is in this section:
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in s for xs in dirt)]
It should instead be:
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in str(s) for xs in dirt)]
The reason is that flithy_scraped_weapon_names
contains Tag
objects, which will be cast to a string when printed, but need to be explicitly cast to a string for xs in str(s)
to work as expected.
Upvotes: 1