ASolidBPlus
ASolidBPlus

Reputation: 11

Python - Changing BeautifulSoup URL's displayed content?

new to programming as of a week or so ago, working on a scraper to get wrestling metadata in Python using BeautifulSoup and https://cagematch.net.

Here is my code:

from BeautifulSoup import BeautifulSoup
import urllib2

link = "https://www.cagematch.net/?id=8&nr=12&page=4"
print link
url = urllib2.urlopen(link) #Cagematch URL for PWG Events
content = url.read()
soup = BeautifulSoup(content)

events = soup.findAll("tr", { "class" : "TRow" }) #Captures all event classes into a list, each event on site is separated by '<tr class="TRow">'

for i in events[1:12]: #For each event, only searches over a years scope
  data = i.findAll("td", { "class" : "TCol TColSeparator"}) #Captures each class on an event into a list item, separated by "<td class="TCol TColSeparator>"
  date = data[0].text #Grabs Date of show, date of show is always first value of "data" list
  show = data[1].text #Grabs name of show, name of show is always second value of "data" list
  status = data[2].text #Grabs event type, if "Event (Card)" show hasn't occurred, if "Event" show has occurred.

  print date, show, status

  if status == "Event": #If event has occurred, get card data
    print "Event already taken place"
    link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
    print content

So the idea is:

  1. Search all event listings from original link, get date of show, show name and status of it has already taken place.
  2. If event has taken place, go to the cards page and get card content (has not been completed yet, functionality as of now is to just print that cards page).

1 works perfectly, it goes to the site just fine and gets what it needs. 2 does not.

I re-declare my "link" variable in the if statement. The link variable changes to the correct link. However, when trying to print content again, it still goes to the original page from when I had originally declared link.

If I re-declare all the variables it works, but surely there's another way to do this?

Upvotes: 1

Views: 81

Answers (1)

alecxe
alecxe

Reputation: 473903

You would not trigger the page content to be changed just by redefining the link variable - you have to request and download the page from the new link:

link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
url = urllib2.urlopen(link) 
content = url.read()

Some other notes:

  • you are using a very outdated BeautifulSoup version 3. Update to BeautifulSoup 4:

    pip install beautifulsoup4 --upgrade
    

    and change your import to:

    from bs4 import BeautifulSoup
    
  • you may improve on performance by switching to requests and reusing the same session for multiple requests to the same domain

  • it is recommended to use urljoin() to concatenate parts of a URL

Upvotes: 2

Related Questions