Reputation:
I'm trying to scrape movie information from the info box on Wikipedia using BeautifulSoup. I'm having trouble scraping movie budgets, as below.
For example, I want to scrape the '$25 million' budget value from the info box. How can I get the budget value, given that the neither the th
nor td
tags are unique? (See example HTML).
Say I have tag = soup.find('th')
with the value
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th>
- How can I get the value of '$25 million' from tag
?
I thought I could do something like tag.td
or tag.text
but neither of these are working for me.
Do I have to loop over all tags and check if their text is equal to 'Budget', and if so get the following cell?
Example HTML Code:
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th>
<td style="line-height:1.3em;">$25 million<sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup></td>
</tr>
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Box office</th>
<td style="line-height:1.3em;">$65.7 million<sup id="cite_ref-BOM_3-0" class="reference"><a href="#cite_note-BOM-3">[3]</a></sup></td>
</tr>
Upvotes: 1
Views: 3632
Reputation: 18208
The other possible way might be:
split_text = soup.get_text().split('\n')
# The next index from Budget is cost
split_text[split_text.index('Budget')+1]
Upvotes: 0
Reputation: 214927
You can firstly find the node with tag td
whose text is Budget
and then find its next sibling td
and get the text from the node:
soup.find("th", text="Budget").find_next_sibling("td").get_text()
# u'$25 million[2]'
Upvotes: 2
Reputation: 444
What you need is find_all() method in BeatifulSoup.
For example:
tdTags = soup.find_all('td',{'class':'reference'})
This means you will find all 'td' tags when class = 'reference'.
You can find whatever td tags you want as long as you find the unique attribute in expected td tags.
Then you can do a for loop to find the content, as @Bijoy said.
Upvotes: 0
Reputation: 1131
To get every Amount in <td>
tags You should use
tags = soup.findAll('td')
and then
for tag in tags:
print tag.get_text() # To get the text i.e. '$25 million'
Upvotes: 0