Reputation: 53
I have been trying to parse out text without any tags. Wanted to build a little scraping tool for myself to help find good DND games to play on Roll20 (I was going to take this data and attach it to a table within each link for the final goal).
The URL I am parsing out info is here: Roll20 Link
I had an idea to try to parse out the text and then put each new line into a list of its own and grab the elements needed. I wanted to grab the info on the game, current players, and current open slots. Here is the code I have done so far. Any suggestions on what I might need to do to scrape this particular data?
Here is my code:
import requests
from bs4 import BeautifulSoup
import time
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
url = r'https://app.roll20.net/lfg/search//?page=0&days=thursday,friday&dayhours=1652932800,1653019200&frequency=onceweekly,biweekly,monthly&timeofday=&timeofday_seconds=&language=English&avpref=Any&gametype=Any&newplayer=false&yesmaturecontent=false&nopaytoplay=false&playingstructured=dnd_next&sortby=relevance&for_event=&roll20con='
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
time.sleep(2)
games= soup.find_all('tr', {'class': 'lfglisting'})
game_urls = []
for item in games:
# item_title = item.find('a', {'class': 'lfglistingname'}).text
# item_url = 'https://app.roll20.net' + item.find('a', {'class': 'lfglistingname'})['href']
current_players = item.get_text("\n", strip=True)
print(current_players)
# try:
# item_game = item.find('strong', {'class': 'label label-success'}).text
# except:
# item_game = 'Role-Playing Game'
# try:
# item_pay = item.find('strong', {'class': 'label label-danger'}).text
# except:
# item_pay = 'Free to Play'
# try:
# item_welcome = item.find('strong', {'class': 'label label-info'}).text
# except:
# item_welcome = 'Experts Only'
# print(f"Game: {item_title}. URL: {item_url}. Notes on Game: {item_game}, {item_pay}, {item_welcome}")
# game_urls.append(item_url)
# print(game_urls)
Upvotes: 0
Views: 58
Reputation: 1598
I started off by looking at the source code of the page, and searching for a know string. (like part of a game description).
it seems every description is inside a <td class='gminfo'>
but, its parent element, the <tr>
, is more intresting as it contains all the desired data.
Notice all of these <tr>
tags have something in common - the data-listingid
attribute.
so let's get all of those.
for x in soup.select('tr[data-listingid]'):
print(x.text.strip())
then, we start parsing, with regex.
import re
def print_data(dct):
for item, amount in dct.items():
print(f"{item} {'-'*(30 - len(item))} {amount}")
soup = BeautifulSoup(r.text, 'html.parser')
listings = soup.select('tr[data-listingid]')
listings_count = len(listings)
print (f"Expecting {listings_count} listings")
parsed_listings = []
for listing in listings:
game = listing.text.strip()
try:
name = re.search("\n{6}(.*)",game).group(1)
info = re.search("\n{3} (.*)",game).groups(1)[0] + "..."
curent_players = re.search("(.*) Current Players",game).groups(0)[0]
open_slots = re.search("\((.*) Open Slots",game).groups(0)[0]
game = {"Name": name, "Info": info, "Current_Players": curent_players, "Open_Slots": open_slots}
parsed_listings.append(game)
print_data(game)
print ("\n=======\n")
except Exception as e:
# print (e)
pass
print (f"parsed {len(parsed_listings)} of {listings_count} total")
Gives:
Expecting 30 listings
Name -------------------------- Curse of Strahd - Grim Hollow/High RP
Info -------------------------- Take this opportunity to play the most popular D&D module ever made with an expert DM who cares about your backstory and wants to...
Current_Players --------------- 1
Open_Slots -------------------- 5
=======
Name -------------------------- The Dragon of Icespire Peak (Monday)
Info -------------------------- Dragon of Icespire Peak is the introductory adventure for the 5th Edition Starter Set, designed for PC levels 1 – 6. It is a...
Current_Players --------------- 1
Open_Slots -------------------- 6
=======
Name -------------------------- Necropolis
Info -------------------------- What ancient horrors lie slumbering in a newly discovered tomb deep in Egypt's Valley of the Kings? Are you allowing local superstitions and the...
Current_Players --------------- 1
Open_Slots -------------------- 4
=======
Name -------------------------- Weekly One-shots (Monday)
Info -------------------------- My car for my primary means of income (Uber) has died and I'm **urgently** trying to raise funds to replace it. If you'd like...
Current_Players --------------- 1
Open_Slots -------------------- 7
=======
Name -------------------------- dragonball z
Info -------------------------- hello all those to whom love dragonball z! i have never DM before but i am willing to give it a chance. im trying...
Current_Players --------------- 1
Open_Slots -------------------- 3
=======
Name -------------------------- Weekly One-shots (Monday)
Info -------------------------- My car for my primary means of income (Uber) has died and I'm **urgently** trying to raise funds to replace it. If you'd like...
Current_Players --------------- 1
Open_Slots -------------------- 7
=======
Name -------------------------- Larula's Tomb
Info -------------------------- 3 Hour, Level 3 One Shot. Gritty, old school feel. Death possible. Backup characters provided. Roll 3d6 straight for stats. Roll for HP. The...
Current_Players --------------- 1
Open_Slots -------------------- 6
=======
Name -------------------------- Vast Stories of Erstonia
Info -------------------------- Vast Stories of Erstonia is a D&D 5e group devoted to playing a series of oneshots provided by the DM. The adventures will be...
Current_Players --------------- 1
Open_Slots -------------------- 4
=======
Name -------------------------- Beasts of Fortune 2
Info -------------------------- The Beasts of Fortune seeks adventures seeking fame, fortune, honor, or just a reason to smack some heads, come one come all to join...
Current_Players --------------- 1
Open_Slots -------------------- 20
=======
...
parsed 22 of 30 total
this is by no means a perfect solution, the parsing isn't perfect at all, but it should get you going.
Of course run this over each page # you want. (the /?page=0
in the url)
If you want the full description of the listing, you're gonna have to GET it, specifically the Read More <a>
tag.
but then you can't use listing.text
as it strips it away.
Also, this isn't legal advice or anything, but I wouldn't be surprised if this is against their site policy, so be wary.
Upvotes: 1