Reputation: 2870
I'm running below code
import requests
session = requests.Session()
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
url = 'https://www.collinsdictionary.com/dictionary/english-french/graduate'
r = session.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')
content1 = soup.select_one('.cB.cB-def.dictionary.biling').contents
temp = re.findall('data-src-mp3="(.*?)"', content1)
then there is an error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-feb1029c98d3> in <module>
10
11 content1 = soup.select_one('.cB.cB-def.dictionary.biling').contents
---> 12 temp = re.findall('data-src-mp3="(.*?)"', content1)
C:\Anaconda3\lib\re.py in findall(pattern, string, flags)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
242
243 def finditer(pattern, string, flags=0):
TypeError: expected string or bytes-like object
IMHO, this is because content1
is a list, not a string as expected. It is weird to me that soup.select_one
returns a list. In below example, it does not return a list.
from bs4 import BeautifulSoup
abc = """abcdd<div class="sense">xyz</div>"""
soup = BeautifulSoup(abc, 'html.parser')
content1 = soup.select_one('.sense')
print(content1)
Could you please elaborate on this issue?
Upvotes: 0
Views: 273
Reputation: 81654
.select_one
does not return a list, it returns a single tag
(as it promises).
content1 = soup.select_one('.cB.cB-def.dictionary.biling')
print(type(content1))
# <class 'bs4.element.Tag'>
It is content1.contents
which returns a list:
print(type(content1.contents))
# <class 'list'>
This list contains tags and elements that are contained in the content1
tag.
If you want the HTML as a string you can use str(content1)
:
print(re.findall('data-src-mp3="(.*?)"', str(content1)))
outputs
['https://www.collinsdictionary.com/sounds/hwd_sounds/EN-GB-W0037420.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/FR-W0037420.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/FR-W0071410.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/fr_bachelier.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/63854.mp3']
However, I'm a bit confused by your choice to use regex. You are already using a proper HTML parser. Generally speaking, using regex to parse HTML should be avoided as HTML is not a regular language, so using a regular expression to parse it might not always work as expected.
Upvotes: 1