Reputation: 5127
I am learning BS4, and I am trying to scrape several tables, lists, etc. from popular sites to familiarize myself with th syntax. I am having a hard time getting a list in the right format. This is the code:
from bs4 import BeautifulSoup
import urllib2
import requests
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Pragma': 'no-cache',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.8'
}
url = 'https://www.yahoo.com'
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
terms = soup.find('ol').get_text()
print terms
Which prints the following list:
1Amanda Knox2Meagan Good3Dog the Bounty Hunter4Adrienne Bailon5Powerball winner6Gillian Anderson7Catherine Zeta-Jones8Mickey Rourke9Halle Berry10Lake Tahoe hotels
The right terms are separated by numbers, which adds an additional level of work, to parse the file so it reads kike "Amanda Knox", "Megan Good", etc.
Since I am not very familiar with BS4, is there a way to get the terms after the "tile=" tag inside my definition of terms?
Upvotes: 2
Views: 427
Reputation: 473763
This is because there are multiple elements inside the ol
tag and get_text()
joins the text of every tag out there.
Instead, you can use a CSS Selector
to get the actual terms:
for li in soup.select('ol.trendingnow_trend-list > li > a'):
print li.get_text()
Prints:
Hope Solo
Christy Mack
Dog the Bounty Hunter
Adrienne Bailon
Powerball winner
Catherine Zeta-Jones
Mickey Rourke
Valerie Velardi
Halle Berry
Lake Tahoe hotels
The ol.trendingnow_trend-list > li > a
css selector matches every a
tag that is directly inside li
that is right inside the ol
tag with trendingnow_trend-list
class attribute.
FYI, this is to get the list of Trending Now
terms from the block on the top right.
Upvotes: 4