UdaraW
UdaraW

Reputation: 98

Obtaining column from wikipedia table using beautifulsoup

source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")

I'm trying to get a list of song names from the table "List of Singles" at Taylor Swift's discography

The table has no unique class or id. The only unique thing I can think of is the caption tag around "List of singles..."

List of singles as main artist, with selected chart positions, sales figures and certifications

I tried:

table = soup.find_all("caption")

but it returns nothing, i'm assuming that caption is not a recognized tag in bs4?

Upvotes: 3

Views: 1906

Answers (2)

Hooked
Hooked

Reputation: 88118

Here is a complete example that solves the "Taylor Swift problem". First look for the caption that contains the text "List of singles" and move to the parent object". Next iterate over the items that have the text you are looking for:

for caption in soup.findAll("caption"):
    if "List of singles" in caption.text:      
        break

table = caption.parent
for item in table.findAll("th", {"scope":"row"}):
    print item.text

This gives:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
...

Upvotes: 1

alecxe
alecxe

Reputation: 473813

It is actually nothing to do with findAll() and find_all(). findAll() was used in BeautifulSoup3 and was left in BeautifulSoup4 for compatibility reasons, quote from the bs4's source code:

def find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs):
    generator = self.descendants
    if not recursive:
        generator = self.children
    return self._find_all(name, attrs, text, limit, generator, **kwargs)

findAll = find_all       # BS3

And, there is a nicer way to get the list of singles, relying on the span element with id="Singles" that indicates the start of Singles paragraph. Then, use the find_next_sibling() to get the first table after the span tag's parent. Then, get all th elements with scope="row":

from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)

Prints:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"

Upvotes: 3

Related Questions