Reputation: 25
How to find all tags that include tags with certain class? The data is:
<tr>
<td class="TDo1" width=17%>Tournament</td>
<td class="TDo2" width=8%>Date</td>
<td class="TDo2" width=6%>Pts.</td>
<td class="TDo2" width=34%>Pos. Player (team)</td>
<td class="TDo5" width=35%>Pos. Opponent (team)</td>
</tr>
<tr>
<td class=TDq1><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class=TDq2><a href="p.pl?t=410&r=4">17.02.02</a></td>
<td class=TDq3>34/75</td>
<td class=TDq5>39. John Deep</td>
<td class=TDq9>68. <a href="p.pl?ply=1229">Mark Deep</a></td>
</tr>
<tr>
<td class=TDp1><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class=TDp2><a href="p.pl?t=410&r=4">17.02.02</a></td>
<td class=TDp3>34/75</td>
<td class=TDp6>39. John Deep</td>
<td class=TDp8>7. <a href="p.pl?ply=10">Darius Star</a></td>
</tr>
I am trying
for mtable in bs.find_all('tr', text=re.compile(r'class=TD?3')):
print(mtable)
but this returns zero results.
Upvotes: 2
Views: 58
Reputation: 4720
You could use css selectors to get the tags with class = "TD...3"
and then get their parent tags
mtables = [s.parent for s in bs.select('*[class^="TD"][class$="3"]')]
# if you want tr only:
# mtables = [s.parent for s in bs.select('tr *[class^="TD"][class$="3"]')]
mtables = list(set(mtables)) # no duplicates
but this will not work if there are multiple classes (unless the first one starts with "TD" and the last ends in "3"), and you can't limit the characters in between.
You could use find
with lambda
twice like this
tagPat = '^TD.3$'
# tagPat = '^TD.*3$' # if there might be more than one character between TD and 3
mtables = bs.find_all(
lambda p:
p.name == 'tr' and # remove if you want all tags and not just tr
p.find(
lambda t: t.get('class') is not None and
len([c for c in t.get('class') if re.search('^TD.3$', c)]) > 0
, recursive=False # prevent getting <tr><tr><td class="TDo3">someval</td></tr></tr>
))
If you don't want to use lambda
, you can replace the select
in the first method with regex and find
tagPat = '^TD.3$' # '^TD.*3$' #
mtables = [
s.parent for s in bs.find_all(class_ = re.compile(tagPat))
if s.parent.name == 'tr' # remove if you want all tags and not just tr
]
mtables = list(set(mtables)) # no duplicates
For the html in your question all 3 methods would lead to same data - you can print with
for mtable in mtables: print('---\n', mtable, '\n---')
and get the output
---
<tr>
<td class="TDq1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class="TDq2"><a href="p.pl?t=410&r=4">17.02.02</a></td>
<td class="TDq3">34/75</td>
<td class="TDq5">39. John Deep</td>
<td class="TDq9">68. <a href="p.pl?ply=1229">Mark Deep</a></td>
</tr>
---
---
<tr>
<td class="TDp1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class="TDp2"><a href="p.pl?t=410&r=4">17.02.02</a></td>
<td class="TDp3">34/75</td>
<td class="TDp6">39. John Deep</td>
<td class="TDp8">7. <a href="p.pl?ply=10">Darius Star</a></td>
</tr>
---
Upvotes: 0
Reputation: 13
This may help you:
from bs4 import BeautifulSoup
import re
t = 'your page source'
pat = re.compile(r'class=TD.3')
classes = re.findall(pat,t)
classes = [j[6:] for j in classes]
soup = BeautifulSoup(t)
result = list()
for i in classes:
item = soup.find_all(attrs={"class": i})
result.extend(item)
for i in result:
print(i.parent)
Upvotes: 0
Reputation: 195573
I suppose you want to find all <tr>
that contains any tag with class TD<any character>3
:
import re
# `html` contains your html from the question
soup = BeautifulSoup(html, "html.parser")
pat = re.compile(r"TD.3")
for tr in soup.find_all(
lambda tag: tag.name == "tr"
and tag.find(class_=lambda cl: cl and pat.match(cl))
):
print(tr)
Prints:
<tr>
<td class="TDq1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class="TDq2"><a href="p.pl?t=410&r=4">17.02.02</a></td>
<td class="TDq3">34/75</td>
<td class="TDq5">39. John Deep</td>
<td class="TDq9">68. <a href="p.pl?ply=1229">Mark Deep</a></td>
</tr>
<tr>
<td class="TDp1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class="TDp2"><a href="p.pl?t=410&r=4">17.02.02</a></td>
<td class="TDp3">34/75</td>
<td class="TDp6">39. John Deep</td>
<td class="TDp8">7. <a href="p.pl?ply=10">Darius Star</a></td>
</tr>
Upvotes: 1
Reputation: 16081
You need to find matching with td
. Like this,
In [1]: bs.find_all('td', {"class": re.compile(r'TD\w\d')})
Out[1]:
[<td class="TDo1" width="17%">Tournament</td>,
<td class="TDo2" width="8%">Date</td>,
<td class="TDo2" width="6%">Pts.</td>,
<td class="TDo2" width="34%">Pos. Player (team)</td>,
<td class="TDo5" width="35%">Pos. Opponent (team)</td>,
<td class="TDq1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>,
<td class="TDq2"><a href="p.pl?t=410&r=4">17.02.02</a></td>,
<td class="TDq3">34/75</td>,
<td class="TDq5">39. John Deep</td>,
<td class="TDq9">68. <a href="p.pl?ply=1229">Mark Deep</a></td>,
<td class="TDp1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>,
<td class="TDp2"><a href="p.pl?t=410&r=4">17.02.02</a></td>,
<td class="TDp3">34/75</td>,
<td class="TDp6">39. John Deep</td>,
<td class="TDp8">7. <a href="p.pl?ply=10">Darius Star</a></td>]
Upvotes: 0