Sourabh Kaushal
Sourabh Kaushal

Reputation: 15

Beautiful soup returns nothing

This is the HTML code:

<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>

I am trying to print 42263 - Unencrypted Telnet Server using Beautiful Soup but the output is an empty element i.e, []

This is my Python code:

from bs4 import BeautifulSoup
import csv
import urllib.request as urllib2

with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
    soup = BeautifulSoup(fp.read(), 'html.parser')

divs = soup.find_all('div', attrs={'background':'#fdc431'})

print(divs)

Upvotes: 1

Views: 586

Answers (2)

radzak
radzak

Reputation: 3118

Solution with regexes:

from bs4 import BeautifulSoup
import re

with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
    soup = BeautifulSoup(fp.read(), 'html.parser')

Let's find the div that matches the following regular expression: background:\s*#fdc431;. \s matches a single Unicode whitespace character. I assumed that there can be 0 or more whitespaces so I added the * modifier to match 0 or more repetitions of the preceding RE. You can read more about regexes here as they sometimes come in handy. I also recommend you this online regex tester.

div = soup.find('div', attrs={'style': re.compile(r'background:\s*#fdc431;')})

This however is equivalent to:

div = soup.find('div', style=re.compile(r'background:\s*#fdc431;'))

You can read about that in the official documentation of BeautifulSoup

Worth reading are also the sections about the kinds of filters you can provide to the find and other similar methods.

You can supply either a string, regular expression, list, True or a function, as shown by Keyur Potdar in his anwser.

Assuming the div exists we can get its text by:

>>> div.text
'42263 - Unencrypted Telnet Server'

Upvotes: 2

Keyur Potdar
Keyur Potdar

Reputation: 7238

background is not an attribute of the div tag. The attributes of the div tag are:

{'xmlns': '', 'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}

So, either you'll have to use

soup.find_all('div', attrs={'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}

or, you can use the lambda function to check if background: #fdc431 is in the style attribute value, like this:

soup = BeautifulSoup('<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>', 'html.parser')
print(soup.find(lambda t: t.name == 'div' and 'background: #fdc431' in t['style']).text)
# 42263 - Unencrypted Telnet Server

or, you can use RegEx, as shown by Jatimir in his answer.

Upvotes: 2

Related Questions