Reputation: 3
I am trying to scrape the info below from some web page. This is the full code:
<tr class="owner">
<td id="P184" class="ownerP" colspan="4">
<ul>
<li><span class="detailType">name:</span><span class="detail">merry/span></li>
<li><a title="sendmessage" class="sendMessageLink" onclick="return openSendMessage('/sendMessage.php',20205" href="" tabindex="0"><span></span>sendmessage</a> <span class="remark_soft">(by pm system)</span></li>
<li><span class="detailType">phone 1</span><a class="detail" href="tel:0387362531">0387362531</a></li>
<li><span class="detailType"></span></li>
</ul>
</td>
</tr>
I want to only get this info (the phone number):
<a class="detail" href="tel:0387362531">0387362531</a>
Here is my code, but it doesn't work:
for details in soup.find_all(attrs= {"class": "detail"}):
re_res = re.search(r"tel:\('.*?',(\d+)\)", details['href'])
print(re_res)
Upvotes: 0
Views: 58
Reputation: 5274
You are pretty close, here you go:
import re
from bs4 import BeautifulSoup
html_doc = """
<tr class="owner"><td id="P184" class="ownerP" colspan="4"><ul>
<li><span class="detailType">name:</span><span class="detail">merry/span></li>
<li><a title="sendmessage" class="sendMessageLink" onclick="return openSendMessage('/sendMessage.php',20205" href="" tabindex="0"><span></span>sendmessage</a> <span class="remark_soft">(by pm system)</span></li><li><span class="detailType">phone 1</span><a class="detail" href="tel:0387362531">0387362531</a></li><li><span class="detailType"></span></li>
</ul></td></tr>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for details in soup.find_all(attrs= {"class": "detail"}):
if "href" in details.attrs and re.search("^tel:", details.attrs["href"]):
print(details.text)
Output:
0387362531
I'm simply looking through the details list you've made and if I find one that has and href
and that href
starts with tel:
then print that value out.
Upvotes: 1
Reputation: 22440
You can get the same result without using regex. In that case, try the below approach:
from bs4 import BeautifulSoup
html_doc = """
<tr class="owner"><td id="P184" class="ownerP" colspan="4"><ul>
<li><span class="detailType">name:</span><span class="detail">merry/span></li>
<li><a title="sendmessage" class="sendMessageLink" onclick="return openSendMessage('/sendMessage.php',20205" href="" tabindex="0"><span></span>sendmessage</a> <span class="remark_soft">(by pm system)</span></li><li><span class="detailType">phone 1</span><a class="detail" href="tel:0387362531">0387362531</a></li><li><span class="detailType"></span></li>
</ul></td></tr>
"""
Using .select()
:
soup = BeautifulSoup(html_doc, 'html.parser')
for telephone in soup.select("a[href^='tel:']"):
if "detail" in telephone['class']:
print(telephone.text)
Or with .find_all()
:
soup = BeautifulSoup(html_doc, 'html.parser')
for telephone in soup.find_all("a",class_="detail"):
if telephone['href'].startswith('tel:'):
print(telephone.text)
They both produce the same output:
0387362531
Upvotes: 0
Reputation: 163577
You have to add the element type a
to find_all and your regex tel:\('.*?',(\d+)\)
tries to match opening and closing parenthesis \(
and \)
which are not in the href
.
You could update your regex to tel:(\d+)
to match tel:
followed by one or more digits in a capturing group (group 1) which you can retrieve with re_res.group(1)
For example:
for details in soup.find_all('a', attrs= {"class": "detail"}):
re_res = re.search(r"tel:(\d+)", details['href'])
print(re_res.group(1)) # 0387362531
Upvotes: 0
Reputation: 69
You should replace soup.find_all(attrs= {"class": "detail"})
by soup.find_all('a', attrs= {"class": "detail"})[0]
in order to avoid having the span
too in details
.
Moreover your regex does not work, this one should work tel:(\d+)
. But rather than using a regex why not just getting a
tag text by doing details.text
?
Upvotes: 0