user7019687
user7019687

Reputation:

Getting value from tag with BeautifulSoup

I'm trying to scrape movie information from the info box on Wikipedia using BeautifulSoup. I'm having trouble scraping movie budgets, as below.

For example, I want to scrape the '$25 million' budget value from the info box. How can I get the budget value, given that the neither the th nor td tags are unique? (See example HTML).

Say I have tag = soup.find('th') with the value <th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th> - How can I get the value of '$25 million' from tag?

I thought I could do something like tag.td or tag.text but neither of these are working for me.

Do I have to loop over all tags and check if their text is equal to 'Budget', and if so get the following cell?

Example HTML Code:

<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th>
<td style="line-height:1.3em;">$25 million<sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup></td>
</tr>
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Box office</th>
<td style="line-height:1.3em;">$65.7 million<sup id="cite_ref-BOM_3-0" class="reference"><a href="#cite_note-BOM-3">[3]</a></sup></td>
</tr>

Upvotes: 1

Views: 3632

Answers (4)

niraj
niraj

Reputation: 18208

The other possible way might be:

split_text = soup.get_text().split('\n')
# The next index from Budget is cost
split_text[split_text.index('Budget')+1]

Upvotes: 0

akuiper
akuiper

Reputation: 214927

You can firstly find the node with tag td whose text is Budget and then find its next sibling td and get the text from the node:

soup.find("th", text="Budget").find_next_sibling("td").get_text()
# u'$25 million[2]'

Upvotes: 2

Wenlong Liu
Wenlong Liu

Reputation: 444

What you need is find_all() method in BeatifulSoup.

For example:

    tdTags = soup.find_all('td',{'class':'reference'})

This means you will find all 'td' tags when class = 'reference'.

You can find whatever td tags you want as long as you find the unique attribute in expected td tags.

Then you can do a for loop to find the content, as @Bijoy said.

Upvotes: 0

Bijoy
Bijoy

Reputation: 1131

To get every Amount in <td> tags You should use

tags = soup.findAll('td')

and then

for tag in tags:
    print tag.get_text() # To get the text i.e. '$25 million' 

Upvotes: 0

Related Questions