SciGuyMcQ
SciGuyMcQ

Reputation: 1043

BeautifulSoup find_all() Doesn't Find All Requested Elements

I am seeing some strange behavior with BeautifulSoup as demonstrated in the example below.

import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')
pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
paras = soup.find_all('p', string=pattern)
print(len(paras)) # expected to find 3 paragraphs with word "color" in it
  2
print(paras[0].prettify())
  <p class="blue">
    This paragraph as a color of blue.
  </p>

print(paras[1].prettify())
  <p>
    This paragraph does not have a color.
  </p>

As you can see for some reason the first paragraph of <p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p> is not being picked up by find_all(...) and I cannot figure out why not.

Upvotes: 1

Views: 1223

Answers (3)

Keyur Potdar
Keyur Potdar

Reputation: 7238

The string property expects the tag to contain only text and not tags. If you try printing .string for the first p tag, it'll return None, since, it has tags in it.

Or, to explain it better, the documentation says:

If a tag has only one child, and that child is a NavigableString, the child is made available as .string

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

The way to overcome this, is to use a lambda function.

html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')

first_p = soup.find('p')
print(first_p)
# <p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>
print(first_p.string)
# None
print(first_p.text)
# This has a color of red. Because it likes the color red

paras = soup.find_all(lambda tag: tag.name == 'p' and 'color' in tag.text.lower())
print(paras)
# [<p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>, <p class="blue">This paragraph has a color of blue.</p>, <p>This paragraph does not have a color.</p>]

Upvotes: 2

SciGuyMcQ
SciGuyMcQ

Reputation: 1043

I haven't actually figured out why specifying the string (or text for older versions of BeautifulSoup) parameter of find_all(...) doesn't give me what I want but, the following does give me a generalized solution.

pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
desired_tags = [tag for tag in soup.find_all('p') if pattern.search(tag.text) is not None]

Upvotes: 0

Kin2Park
Kin2Park

Reputation: 23

If you want to grap the 'p' you can just do:

import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')

paras = soup.find_all('p')
for p in paras:
  print (p.get_text())

Upvotes: 0

Related Questions