Reputation: 163
I have a big long table in an HTML, so the tags aren't nested within each other. It looks like this:
<tr>
<td>A</td>
</tr>
<tr>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
</tr>
<tr>
<td class ="y">...</td>
<td class ="y">...</td>
<td class ="y">...</td>
<td class ="y">...</td>
</tr>
<tr>
<td>B</td>
</tr>
<tr>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
</tr>
<tr>
<td class ="y">I want this</td>
<td class ="y">and this</td>
<td class ="y">and this</td>
<td class ="y">and this</td>
</tr>
So first I want to search the tree to find "B". Then I want to grab the text of every td tag with class y after B but before the next row of table starts over with "C".
I've tried this:
results = soup.find_all('td')
for result in results:
if result.string == "B":
print(result.string)
This gets me the string B that I want. But now I am trying to find all after this and I'm not getting what I want.
for results in soup.find_all('td'):
if results.string == 'B':
a = results.find_next('td',class_='y')
This gives me the next td after the 'B', which is what I want, but I can only seem to get that first td tag. I want to grab all of the tags that have class y, after 'B' but before 'C' (C isn't shown in the html, but follows the same pattern), and I want to it to a list.
My resulting list would be:
[['I want this'],['and this'],['and this'],['and this']]
Upvotes: 4
Views: 6920
Reputation: 473863
Basically, you need to locate the element containing B
text. This is your starting point.
Then, check every tr
sibling of this element using find_next_siblings()
:
start = soup.find("td", text="B").parent
for tr in start.find_next_siblings("tr"):
# exit if reached C
if tr.find("td", text="C"):
break
# get all tds with a desired class
tds = tr.find_all("td", class_="y")
for td in tds:
print(td.get_text())
Tested on your example data, it prints:
I want this
and this
and this
and this
Upvotes: 3