alphamonkey
alphamonkey

Reputation: 249

Python BeautifulSoup webcrawling getting text tag inside link

Image

I need to get the information within the "< b >" tags for each website.

response = requests.get(href)
    soup = BeautifulSoup(response.content, "lxml")  # or BeautifulSoup(response.content, "html5lib")
    tempWeekend = []
    print soup.findAll('b')

The soup.findAll('b') line prints all the b tags in the site, how can I limit it to just the dates that I want?

The website is http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm, under the weekend tab.

Upvotes: 0

Views: 173

Answers (5)

One
One

Reputation: 38

i would try something like

all_a = site.find_all('a')
for a in all_a:
    if '?yr=?' in a['href']:
        dates.append(a.get_text())

Upvotes: 0

xrisk
xrisk

Reputation: 3898

Why not search for all the b tags, and choose the ones which contain a month?

import requests
from bs4 import BeautifulSoup

s  = requests.get('http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm').content

soup = BeautifulSoup(s, "lxml")  # or BeautifulSoup(response.content, "html5lib")
dates = []
for i in soup.find_all('b'):
    if i.text.split()[0].upper() in "JAN FEB MAR APR JUN JUL AUG SEP OCT NOV DEC":
        dates.append(i.text)

print dates

(Note: I did not check the exact abbreviations that the website uses. Please check these first and accordingly modify the code)

Upvotes: 1

Ren&#233; Fleschenberg
Ren&#233; Fleschenberg

Reputation: 2548

It is often easiest to search using CSS selectors, e.g.

soup.select('table.chart-wide > tr > td > nobr > font > a > b')

Upvotes: 2

dstudeba
dstudeba

Reputation: 9038

Looking at that page it doesn't have any divs or class or id tags which makes it tough. The only pattern I could see what that the <b> tag directly before the dates was <b>Date:</b>. I would iterate over the <b> tags and then collect the tags after I hit the one with Date in it.

Upvotes: 0

Lawrence Benson
Lawrence Benson

Reputation: 1406

Sadly, if the tags are not further identified, there is no way to select specific ones. How should BeautifulSoup be able to distinguish between them. If you know what to roughly expect in the tags you need you could iterate over all of them and check if they match:

for b in soup.findAll('b):
    if b.innerHTML == whatever:
        return b

or something like that...

Or you could get the surrounding tags, i.e. 'a' in your example and check if that matches and then get the next occurence of 'b'.

Upvotes: 1

Related Questions