Reputation: 2870
I'm trying to extract all the values (which are links) of attribute data-src-mp3
in the content1
generated from the url.
The link is contained in <a class="hwd_sound sound audio_play_button icon-volume-up ptr" title="Pronunciation for " data-src-mp3="https://www.collinsdictionary.com/sounds/hwd_sounds/EN-GB-W0037420.mp3" data-lang="en_GB"></a>
.
One method is to use regrex 'data-src-mp3="(.*?)"'
import requests
session = requests.Session()
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
url = 'https://www.collinsdictionary.com/dictionary/english-french/graduate'
r = session.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')
content1 = soup.select_one('.cB.cB-def.dictionary.biling').contents
output = re.findall('data-src-mp3="(.*?)"', str(content1))
print(output)
the result is
['https://www.collinsdictionary.com/sounds/hwd_sounds/EN-GB-W0037420.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/FR-W0037420.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/FR-W0071410.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/fr_bachelier.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/63854.mp3']
I would like to ask how to use BeautifulSoup
and the structure <a class="hwd_sound sound audio_play_button icon-volume-up ptr" title="Pronunciation for " data-src-mp3="https://www.collinsdictionary.com/sounds/hwd_sounds/EN-GB-W0037420.mp3" data-lang="en_GB"></a>
to obtain the same result without loop.
Thank you so much!
Upvotes: 0
Views: 68
Reputation: 81614
You can combine selectors when using .select
:
mp3s = [tag.attrs['data-src-mp3'] for tag in soup.select('.cB.cB-def.dictionary.biling [data-src-mp3]')]
or
mp3s = list(map(lambda tag: tag.attrs['data-src-mp3'],
soup.select('.cB.cB-def.dictionary.biling [data-src-mp3]')))
[data-src-mp3]
selects only elements that have the data-src-mp3
attribute (with any value).
With a small change to have 'data-src-mp3'
in a single place:
mp3_tag = 'data-src-mp3'
mp3s = list(map(lambda tag: tag.attrs[mp3_tag],
soup.select('.cB.cB-def.dictionary.biling [{}]'.format(mp3_tag))))
This solution might look more intimidating at first, but is much better than relying on the wrong tool (such as regex when parsing HTML).
Upvotes: 1