Reputation: 253
I am trying to write a for loop to retrieve some data and I am currently stuck. I need to get the second value when the first contains "Primary NAICS Code"
<td class="col_left"><strong>Primary NAICS Code</strong></td>
<td align="left">
311811 : Retail Bakeries
</td>
My for loop which is obviously not working looks like this
for i, elem in enumerate(all_trs):
inside_td = elem.find("td")
if "NAICS" in inside_td.text:
inside_td = elem.find("td")
print(inside_td.text)
Really appreciate any help I could get. Thank you very much in advance.
Upvotes: 0
Views: 821
Reputation: 53623
Untested, but instead of:
for i, elem in enumerate(all_trs):
inside_td = elem.find("td")
if "NAICS" in inside_td.text:
inside_td = elem.find("td")
print(inside_td.text)
Try this:
for i, elem in enumerate(all_trs):
td_elems = elem.findAll('td')
if 'NAICS' in td_elems[0].text:
print(td_elems[1].text)
The findAll
method returns a list of td
elements so, just get a handle on this sequence, and you can of course index it :)
find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.
The find
method returns only the first td
element, equivalent to basically: findAll('td')[0]
find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)
Return only the first child of this Tag matching the given criteria.
Upvotes: 1
Reputation: 57033
It is the second next sibling of the <td>
that contains the string of interest (the immediate next sibling is a line break):
import re
...
soup.body.findAll('td', text=re.compile('Primary NAICS Code'))[0]\
.next_sibling.next_sibling
#<td align="left">
#
# 311811 : Retail Bakeries
# </td>
Upvotes: 0