Akira
Akira

Reputation: 2870

Why does `soup.select_one` return a list?

I'm running below code

import requests
session = requests.Session()
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
url = 'https://www.collinsdictionary.com/dictionary/english-french/graduate'
r = session.get(url, headers = headers)           
soup = BeautifulSoup(r.content, 'html.parser')

content1 = soup.select_one('.cB.cB-def.dictionary.biling').contents
temp = re.findall('data-src-mp3="(.*?)"', content1)

then there is an error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-feb1029c98d3> in <module>
     10 
     11 content1 = soup.select_one('.cB.cB-def.dictionary.biling').contents
---> 12 temp = re.findall('data-src-mp3="(.*?)"', content1)

C:\Anaconda3\lib\re.py in findall(pattern, string, flags)
    239 
    240     Empty matches are included in the result."""
--> 241     return _compile(pattern, flags).findall(string)
    242 
    243 def finditer(pattern, string, flags=0):

TypeError: expected string or bytes-like object

IMHO, this is because content1 is a list, not a string as expected. It is weird to me that soup.select_one returns a list. In below example, it does not return a list.

from bs4 import BeautifulSoup

abc = """abcdd<div class="sense">xyz</div>"""
soup = BeautifulSoup(abc, 'html.parser')
content1 = soup.select_one('.sense')

print(content1)

Could you please elaborate on this issue?

Upvotes: 0

Views: 273

Answers (1)

DeepSpace
DeepSpace

Reputation: 81654

.select_one does not return a list, it returns a single tag (as it promises).

content1 = soup.select_one('.cB.cB-def.dictionary.biling')
print(type(content1))
# <class 'bs4.element.Tag'>

It is content1.contents which returns a list:

print(type(content1.contents))
# <class 'list'>

This list contains tags and elements that are contained in the content1 tag.

If you want the HTML as a string you can use str(content1):

print(re.findall('data-src-mp3="(.*?)"', str(content1)))

outputs

['https://www.collinsdictionary.com/sounds/hwd_sounds/EN-GB-W0037420.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/FR-W0037420.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/FR-W0071410.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/fr_bachelier.mp3', 'https://www.collinsdictionary.com/sounds/hwd_sounds/63854.mp3']

However, I'm a bit confused by your choice to use regex. You are already using a proper HTML parser. Generally speaking, using regex to parse HTML should be avoided as HTML is not a regular language, so using a regular expression to parse it might not always work as expected.

Upvotes: 1

Related Questions