Reputation: 11
new to programming as of a week or so ago, working on a scraper to get wrestling metadata in Python using BeautifulSoup and https://cagematch.net.
Here is my code:
from BeautifulSoup import BeautifulSoup
import urllib2
link = "https://www.cagematch.net/?id=8&nr=12&page=4"
print link
url = urllib2.urlopen(link) #Cagematch URL for PWG Events
content = url.read()
soup = BeautifulSoup(content)
events = soup.findAll("tr", { "class" : "TRow" }) #Captures all event classes into a list, each event on site is separated by '<tr class="TRow">'
for i in events[1:12]: #For each event, only searches over a years scope
data = i.findAll("td", { "class" : "TCol TColSeparator"}) #Captures each class on an event into a list item, separated by "<td class="TCol TColSeparator>"
date = data[0].text #Grabs Date of show, date of show is always first value of "data" list
show = data[1].text #Grabs name of show, name of show is always second value of "data" list
status = data[2].text #Grabs event type, if "Event (Card)" show hasn't occurred, if "Event" show has occurred.
print date, show, status
if status == "Event": #If event has occurred, get card data
print "Event already taken place"
link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
print content
So the idea is:
1 works perfectly, it goes to the site just fine and gets what it needs. 2 does not.
I re-declare my "link" variable in the if statement. The link variable changes to the correct link. However, when trying to print content again, it still goes to the original page from when I had originally declared link.
If I re-declare all the variables it works, but surely there's another way to do this?
Upvotes: 1
Views: 81
Reputation: 473903
You would not trigger the page content to be changed just by redefining the link
variable - you have to request and download the page from the new link:
link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
url = urllib2.urlopen(link)
content = url.read()
Some other notes:
you are using a very outdated BeautifulSoup
version 3. Update to BeautifulSoup
4:
pip install beautifulsoup4 --upgrade
and change your import to:
from bs4 import BeautifulSoup
you may improve on performance by switching to requests
and reusing the same session for multiple requests to the same domain
it is recommended to use urljoin()
to concatenate parts of a URL
Upvotes: 2