Reputation: 249

Python BeautifulSoup webcrawling getting text tag inside link

I need to get the information within the "" tags for each website.

response = requests.get(href)
    soup = BeautifulSoup(response.content, "lxml")  # or BeautifulSoup(response.content, "html5lib")
    tempWeekend = []
    print soup.findAll('b')

The soup.findAll('b') line prints all the b tags in the site, how can I limit it to just the dates that I want?

The website is http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm, under the weekend tab.

Upvotes: 0

Answers (5)

One

Reputation: 38

i would try something like

all_a = site.find_all('a')
for a in all_a:
    if '?yr=?' in a['href']:
        dates.append(a.get_text())

Upvotes: 0

xrisk

Reputation: 3898

Why not search for all the b tags, and choose the ones which contain a month?

import requests
from bs4 import BeautifulSoup

s  = requests.get('http://www.boxofficemojo.com/movies/?page=weekend&id=catchingfire.htm').content

soup = BeautifulSoup(s, "lxml")  # or BeautifulSoup(response.content, "html5lib")
dates = []
for i in soup.find_all('b'):
    if i.text.split()[0].upper() in "JAN FEB MAR APR JUN JUL AUG SEP OCT NOV DEC":
        dates.append(i.text)

print dates

(Note: I did not check the exact abbreviations that the website uses. Please check these first and accordingly modify the code)

Upvotes: 1

René Fleschenberg

Reputation: 2548

It is often easiest to search using CSS selectors, e.g.

soup.select('table.chart-wide > tr > td > nobr > font > a > b')

Upvotes: 2

dstudeba

Reputation: 9038

Looking at that page it doesn't have any divs or class or id tags which makes it tough. The only pattern I could see what that the  tag directly before the dates was Date:. I would iterate over the  tags and then collect the tags after I hit the one with Date in it.

Upvotes: 0

Lawrence Benson

Reputation: 1406

Sadly, if the tags are not further identified, there is no way to select specific ones. How should BeautifulSoup be able to distinguish between them. If you know what to roughly expect in the tags you need you could iterate over all of them and check if they match:

for b in soup.findAll('b):
    if b.innerHTML == whatever:
        return b

or something like that...

Or you could get the surrounding tags, i.e. 'a' in your example and check if that matches and then get the next occurence of 'b'.

Upvotes: 1

Python BeautifulSoup webcrawling getting text tag inside link

Answers (5)

Related Questions