Reputation: 13
I'm writing a script using Beautiful Soup that scrapes the start and end dates of the every Wimbledon tennis tournament from 1999-2003 from their respective Wikipedia pages.
I want to get the date range as a list of selectable objects, and I have written a script that achieves this:
from bs4 import BeautifulSoup
import requests
import re
import os
year = 1999
url = "https://en.wikipedia.org/wiki/{tourneyYear}_Wimbledon_Championships"
range = [1, 2, 3, 4, 5,]
for number in range:
response = requests.get(url.format(tourneyYear=year))
text = response.text
soup = BeautifulSoup(text, "html.parser")
overviewTable = soup.find('table', attrs ={'class':"infobox vevent"})
date = overviewTable.find('th', attrs={"scope":"row"}).parent
specialResult = date.find('td')
for sentence in specialResult:
words = sentence.split()
print(words)
year += 1
The loop iterates through the webpages ('year' increases by 1 each time and slots into the URL structure that I've defined - this part works perfectly, by the way) and it is supposed to print the list at the end.
For the first two iterations of the loop (for the 1999 and 2000 Wimbledon pages), the list prints just fine. But on the third iteration it returns the following error:
Traceback (most recent call last):
File
"XYZ", line 21, in <module>
words = sentence.split()
TypeError: 'NoneType' object is not callable
The HTML structure of the relevant part of each webpage is identical (as far as I can tell), and the loop only fails for the 2001 iteration (I know this because if I set the loop to iterate for any five-year range that doesn't include 2001, it works just fine).
Is there an error in my code, or does is that particular webpage different in some way that I haven't noticed? I've been racking my brains for days on this one, to no avail!
Upvotes: 0
Views: 2739
Reputation: 17761
TL;DR: You need to remove the for
-loop and use get_text()
in order to obtain the text of each element and then split()
it:
date = overviewTable.find('th', attrs={"scope":"row"}).parent
words = date.find('th').get_text().split()
Explanation:
find()
does not return a list of strings, it returns a single Tag
object. Therefore, what you have in your specialResult
is a Tag object.
When you iterate over a Tag object, you can get two types of items: strings (for text) and other Tag objects (for inner elements). Your code is failing because specialResult
does not contain just text, but also a sub-element:
[u'25 June \u2013 9 July', <sup class="reference" id="cite_ref-info_1-0"><a href="#cite_note-info-1">[1]</a></sup>]
The sup
element here is not a string, it is a Tag object, it has no split()
method, and that's why you're getting an exception.
Upvotes: 2