Reputation: 109
I would like to extract just the item weight and the product dimensions from "content" below. What am I missing here? In my script, the content that I am looking for is not found. Is there a simpler way to just extract item weight and product dimensions? Thanks
import bs4 as bs
content = '''
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base prodDetAttrValue">
0.16 ounces
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base prodDetAttrValue">
4.8 x 3.4 x 0.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Required?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
'''
soup = bs.BeautifulSoup(content, features='lxml')
try:
product = {
'weight': soup.find(text='Item Weight').parent.find_next_siblings(),
'dimension': soup.find(text='Product Dimensions').parent.find_next_siblings()
}
except:
product = {
'weight': 'item unavailable',
'dimension': 'item unavailable'
}
print(product)
Traceback:
{'weight': 'item unavailable', 'dimension': 'item unavailable'}
Upvotes: 0
Views: 70
Reputation: 2692
First of all, if you want to find immediate next sibling, you need to use .find_next_sibling()
instead of .find_next_siblings()
. Then the reason why you are not getting any output is the representation of text inside tags. If you do:
print([each_th.text for each_th in soup.find_all('th')])
You will see that the result would look like this:
['\nItem Weight\n', '\nProduct Dimensions\n', '\nBatteries Included?\n', '\nBatteries Required?\n']
So, you need to change text='Item Weight'
to text='\nItem Weight\n'
and so on:
try:
product = {
'weight': soup.find(text='\nItem Weight\n').parent.find_next_sibling().text,
'dimension': soup.find(text='\nProduct Dimensions\n').parent.find_next_sibling().text
}
except:
product = {
'weight': 'item unavailable',
'dimension': 'item unavailable'
}
This will give:
{'weight': '\n0.16 ounces\n', 'dimension': '\n4.8 x 3.4 x 0.5 inches\n'}
Now if you want to remove those newline characters, you can use either .replace('\n', '')
or .strip()
to do it when grabbing it.
Upvotes: 1
Reputation:
You're using find next sibling incorrectly. The td
tag is the sibling of the th
tag and not of the parent tr
tag.
from bs4 import BeautifulSoup
import re
content = '''
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base prodDetAttrValue">
0.16 ounces
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base prodDetAttrValue">
4.8 x 3.4 x 0.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
'''
soup = BeautifulSoup(content, 'html.parser')
d = {
'weight': soup.find('th', text=re.compile('\s*Item Weight\s*')).find_next_sibling('td').text.strip(),
'dimension': soup.find('th', text=re.compile('\s*Product Dimensions\s*')).find_next_sibling('td').text.strip()
}
print(d)
Upvotes: 1