Luis Miguel
Luis Miguel

Reputation: 5127

parsing data using beautifulsoup

I am learning BS4, and I am trying to scrape several tables, lists, etc. from popular sites to familiarize myself with th syntax. I am having a hard time getting a list in the right format. This is the code:

from bs4 import BeautifulSoup
import urllib2
import requests

headers = {
  'Connection': 'keep-alive',
  'Cache-Control': 'no-cache',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
  'Pragma': 'no-cache',
  'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
  'Accept-Language': 'en-US,en;q=0.8'
}

url = 'https://www.yahoo.com'

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
terms = soup.find('ol').get_text()
print terms

Which prints the following list:

1Amanda Knox2Meagan Good3Dog the Bounty Hunter4Adrienne Bailon5Powerball winner6Gillian Anderson7Catherine Zeta-Jones8Mickey Rourke9Halle Berry10Lake Tahoe hotels

The right terms are separated by numbers, which adds an additional level of work, to parse the file so it reads kike "Amanda Knox", "Megan Good", etc.

Since I am not very familiar with BS4, is there a way to get the terms after the "tile=" tag inside my definition of terms?

Upvotes: 2

Views: 427

Answers (1)

alecxe
alecxe

Reputation: 473763

This is because there are multiple elements inside the ol tag and get_text() joins the text of every tag out there.

Instead, you can use a CSS Selector to get the actual terms:

for li in soup.select('ol.trendingnow_trend-list > li > a'):
    print li.get_text()

Prints:

Hope Solo
Christy Mack
Dog the Bounty Hunter
Adrienne Bailon
Powerball winner
Catherine Zeta-Jones
Mickey Rourke
Valerie Velardi
Halle Berry
Lake Tahoe hotels

The ol.trendingnow_trend-list > li > a css selector matches every a tag that is directly inside li that is right inside the ol tag with trendingnow_trend-list class attribute.

FYI, this is to get the list of Trending Now terms from the block on the top right.

Upvotes: 4

Related Questions