Tomasz Maciążek
Tomasz Maciążek

Reputation: 25

BeautifulSoup to find a HTML tag that contains tags with specific class

How to find all tags that include tags with certain class? The data is:

<tr>
<td class="TDo1" width=17%>Tournament</td>
<td class="TDo2" width=8%>Date</td>
<td class="TDo2" width=6%>Pts.</td>
<td class="TDo2" width=34%>Pos. Player (team)</td>
<td class="TDo5" width=35%>Pos. Opponent (team)</td>
</tr>

<tr>
<td class=TDq1><a href="p.pl?t=410">GpWl(op)&nbsp;4.01/02</a></td>
<td class=TDq2><a href="p.pl?t=410&r=4">17.02.02</a></td>
<td class=TDq3>34/75</td>
<td class=TDq5>39. John Deep</td>
<td class=TDq9>68. <a href="p.pl?ply=1229">Mark Deep</a></td>
</tr>

<tr>
<td class=TDp1><a href="p.pl?t=410">GpWl(op)&nbsp;4.01/02</a></td>
<td class=TDp2><a href="p.pl?t=410&r=4">17.02.02</a></td>
<td class=TDp3>34/75</td>
<td class=TDp6>39. John Deep</td>
<td class=TDp8>7. <a href="p.pl?ply=10">Darius Star</a></td>
</tr>

I am trying

for mtable in bs.find_all('tr', text=re.compile(r'class=TD?3')):
print(mtable)

but this returns zero results.

Upvotes: 2

Views: 58

Answers (4)

Driftr95
Driftr95

Reputation: 4720

You could use css selectors to get the tags with class = "TD...3" and then get their parent tags

mtables = [s.parent for s in bs.select('*[class^="TD"][class$="3"]')]

# if you want tr only:
# mtables = [s.parent for s in bs.select('tr *[class^="TD"][class$="3"]')]

mtables = list(set(mtables)) # no duplicates

but this will not work if there are multiple classes (unless the first one starts with "TD" and the last ends in "3"), and you can't limit the characters in between.


You could use find with lambda twice like this

tagPat = '^TD.3$' 
# tagPat = '^TD.*3$' # if there might be more than one character between TD and 3
mtables = bs.find_all(
  lambda p: 
  p.name == 'tr' and # remove if you want all tags and not just tr
  p.find(
    lambda t: t.get('class') is not None and 
    len([c for c in t.get('class') if re.search('^TD.3$', c)]) > 0
    , recursive=False # prevent getting <tr><tr><td class="TDo3">someval</td></tr></tr>
  ))

If you don't want to use lambda, you can replace the select in the first method with regex and find

tagPat = '^TD.3$' # '^TD.*3$' #
mtables = [
    s.parent for s in bs.find_all(class_ = re.compile(tagPat))
    if s.parent.name == 'tr' # remove if you want all tags and not just tr
]
mtables = list(set(mtables)) # no duplicates


For the html in your question all 3 methods would lead to same data - you can print with

for mtable in mtables: print('---\n', mtable, '\n---')

and get the output

---
 <tr>
<td class="TDq1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class="TDq2"><a href="p.pl?t=410&amp;r=4">17.02.02</a></td>
<td class="TDq3">34/75</td>
<td class="TDq5">39. John Deep</td>
<td class="TDq9">68. <a href="p.pl?ply=1229">Mark Deep</a></td>
</tr> 
---
---
 <tr>
<td class="TDp1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class="TDp2"><a href="p.pl?t=410&amp;r=4">17.02.02</a></td>
<td class="TDp3">34/75</td>
<td class="TDp6">39. John Deep</td>
<td class="TDp8">7. <a href="p.pl?ply=10">Darius Star</a></td>
</tr> 
---

Upvotes: 0

amirmohamad
amirmohamad

Reputation: 13

This may help you:

from bs4 import BeautifulSoup
import re

t = 'your page source' 
pat = re.compile(r'class=TD.3')
classes = re.findall(pat,t)
classes = [j[6:] for j in classes]
soup = BeautifulSoup(t)
result = list()
for i in classes:
    item = soup.find_all(attrs={"class": i})
    result.extend(item)
for i in result:
    print(i.parent)

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195573

I suppose you want to find all <tr> that contains any tag with class TD<any character>3:

import re

# `html` contains your html from the question
soup = BeautifulSoup(html, "html.parser")
pat = re.compile(r"TD.3")

for tr in soup.find_all(
    lambda tag: tag.name == "tr"
    and tag.find(class_=lambda cl: cl and pat.match(cl))
):
    print(tr)

Prints:

<tr>
<td class="TDq1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class="TDq2"><a href="p.pl?t=410&amp;r=4">17.02.02</a></td>
<td class="TDq3">34/75</td>
<td class="TDq5">39. John Deep</td>
<td class="TDq9">68. <a href="p.pl?ply=1229">Mark Deep</a></td>
</tr>
<tr>
<td class="TDp1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>
<td class="TDp2"><a href="p.pl?t=410&amp;r=4">17.02.02</a></td>
<td class="TDp3">34/75</td>
<td class="TDp6">39. John Deep</td>
<td class="TDp8">7. <a href="p.pl?ply=10">Darius Star</a></td>
</tr>

Upvotes: 1

Rahul K P
Rahul K P

Reputation: 16081

You need to find matching with td. Like this,

In [1]: bs.find_all('td', {"class": re.compile(r'TD\w\d')})
Out[1]: 
[<td class="TDo1" width="17%">Tournament</td>,
 <td class="TDo2" width="8%">Date</td>,
 <td class="TDo2" width="6%">Pts.</td>,
 <td class="TDo2" width="34%">Pos. Player (team)</td>,
 <td class="TDo5" width="35%">Pos. Opponent (team)</td>,
 <td class="TDq1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>,
 <td class="TDq2"><a href="p.pl?t=410&amp;r=4">17.02.02</a></td>,
 <td class="TDq3">34/75</td>,
 <td class="TDq5">39. John Deep</td>,
 <td class="TDq9">68. <a href="p.pl?ply=1229">Mark Deep</a></td>,
 <td class="TDp1"><a href="p.pl?t=410">GpWl(op) 4.01/02</a></td>,
 <td class="TDp2"><a href="p.pl?t=410&amp;r=4">17.02.02</a></td>,
 <td class="TDp3">34/75</td>,
 <td class="TDp6">39. John Deep</td>,
 <td class="TDp8">7. <a href="p.pl?ply=10">Darius Star</a></td>]

Upvotes: 0

Related Questions