Yuliia
Yuliia

Reputation: 39

How parse a link by text?

I have content:

<p><a href="/dms_pubrec/itu-t/rec/q/T-REC-Q.1238.3-200006-I!!TOC-TXT-E.txt" target="_blank"><strong><font size="1">Table of Contents </font></strong></a></p>
</td>
</tr>
<tr>
<td width="80%">   </td>
<td align="right" bgcolor="#FFFF80" style="font-size: 9pt;">
<p><a href="./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a></p>
</td>
</tr>
<tr>
<td colspan="2" style="font-size: 9pt;color: red;">
<p>This Recommendation includes an electronic attachment containing the ASN.1 definitions for the IN CS-3 SCF-SRF interface</p>
</td>
</tr>

I want to extract:

<a href="./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a>

My code:

import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"

q = requests.get(url)
result = q.content

soup = BeautifulSoup(result, 'html.parser')

Upvotes: 0

Views: 37

Answers (2)

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16187

You want to pull the url which is associated with the text Summary :

import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"

q = requests.get(url)
result = q.content

soup = BeautifulSoup(result, 'html.parser')

link= soup.select_one('a:-soup-contains("Summary")').get('href')

print('https://www.itu.int/rec/T-REC-Q.1238.3-200006-I'+link)

Output:

https://www.itu.int/rec/T-REC-Q.1238.3-200006-I./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt

Upvotes: 1

laurentivs
laurentivs

Reputation: 27

If you want to get the content and href links in an <a> tag you can loop over the content with find_all as follows:

for a in soup.find_all('a', href=True):
    return a.contents, a['href']

Upvotes: 1

Related Questions