Reputation: 98
source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")
I'm trying to get a list of song names from the table "List of Singles" at Taylor Swift's discography
The table has no unique class or id. The only unique thing I can think of is the caption tag around "List of singles..."
List of singles as main artist, with selected chart positions, sales figures and certifications
I tried:
table = soup.find_all("caption")
but it returns nothing, i'm assuming that caption is not a recognized tag in bs4?
Upvotes: 3
Views: 1906
Reputation: 88118
Here is a complete example that solves the "Taylor Swift problem". First look for the caption that contains the text "List of singles" and move to the parent object". Next iterate over the items that have the text you are looking for:
for caption in soup.findAll("caption"):
if "List of singles" in caption.text:
break
table = caption.parent
for item in table.findAll("th", {"scope":"row"}):
print item.text
This gives:
"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
...
Upvotes: 1
Reputation: 473813
It is actually nothing to do with findAll()
and find_all()
. findAll()
was used in BeautifulSoup3
and was left in BeautifulSoup4
for compatibility reasons, quote from the bs4
's source code:
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):
generator = self.descendants
if not recursive:
generator = self.children
return self._find_all(name, attrs, text, limit, generator, **kwargs)
findAll = find_all # BS3
And, there is a nicer way to get the list of singles, relying on the span
element with id="Singles"
that indicates the start of Singles
paragraph. Then, use the find_next_sibling()
to get the first table after the span
tag's parent. Then, get all th
elements with scope="row"
:
from bs4 import BeautifulSoup
import requests
source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)
table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
print(single.text)
Prints:
"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"
Upvotes: 3