How to use BeautifulSoup to get the same result obtained by regex?

Question

I'm trying to extract all the values (which are links) of attribute data-src-mp3 in the content1 generated from the url.

The link is contained in .

One method is to use regrex 'data-src-mp3="(.*?)"'

import requests
session = requests.Session()
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
url = 'https://www.collinsdictionary.com/dictionary/english-french/graduate'
r = session.get(url, headers = headers)           
soup = BeautifulSoup(r.content, 'html.parser')

content1 = soup.select_one('.cB.cB-def.dictionary.biling').contents
output = re.findall('data-src-mp3="(.*?)"', str(content1))

print(output)

the result is

['https://www.collinsdictionary.com/sounds/hwd_sounds/EN-GB-W0037420.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/FR-W0037420.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/FR-W0071410.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/fr_bachelier.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/63854.mp3']

I would like to ask how to use BeautifulSoup and the structure to obtain the same result without loop.

Thank you so much!

DeepSpace · Accepted Answer

You can combine selectors when using .select:

mp3s = [tag.attrs['data-src-mp3'] for tag in soup.select('.cB.cB-def.dictionary.biling [data-src-mp3]')]

or

mp3s = list(map(lambda tag: tag.attrs['data-src-mp3'],
                soup.select('.cB.cB-def.dictionary.biling [data-src-mp3]')))

[data-src-mp3] selects only elements that have the data-src-mp3 attribute (with any value).

With a small change to have 'data-src-mp3' in a single place:

mp3_tag = 'data-src-mp3'
mp3s = list(map(lambda tag: tag.attrs[mp3_tag],
                soup.select('.cB.cB-def.dictionary.biling [{}]'.format(mp3_tag))))

This solution might look more intimidating at first, but is much better than relying on the wrong tool (such as regex when parsing HTML).

How to use BeautifulSoup to get the same result obtained by regex?

Answers (1)

Related Questions