Reputation: 87
I am trying to scrape some words from any random website, but the following program is not showing errors and not showing any output when i tried printing the results.
I have checked the code twice and even incorporated an if statement to see whether the program is getting any words or not.
import requests
import operator
from bs4 import BeautifulSoup
def word_count(url):
wordlist = []
source_code = requests.get(url)
source = BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('a', {'class':'txt'}):
word_string=post_text.string
if word_string is not None:
word = word_string.lower().split()
for each_word in word:
print(each_word)
wordlist.append(each_word)
else:
print("None")
word_count('https://mumbai.craigslist.org/')
I am expecting all the words under the "class= txt" to be displayed in the output.
Upvotes: 3
Views: 101
Reputation: 3744
I have visited https://mumbai.craigslist.org/, and find there is no <a class="txt">
, only <span class="txt">
, so I think you can try this:
def word_count(url):
wordlist = []
source_code = requests.get(url)
source=BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('span', {'class':'txt'}):
word_string=post_text.text
if word_string is not None:
word = word_string.lower().split ()
for each_word in word:
print(each_word)
wordlist.append(each_word)
else:
print("None")
it will output correctly:
community
activities
artists
childcare
classes
events
general
...
Hope that helps you, and comment if you have further questions. : )
Upvotes: 2
Reputation: 16782
OP: I am expecting all the words of the class text to be displayed in the output
The culprit:
for post_text in source.findAll('a', {'class':'txt'}):
The reason:
anchor tag has no class txt
but the span tag inside it does.
Hence:
import requests
from bs4 import BeautifulSoup
def word_count(url):
source_code = requests.get(url)
source=BeautifulSoup(source_code.text, features="html.parser")
for post_text in source.findAll('a'):
s_text = post_text.find('span', class_ = "txt")
if s_text is not None:
print(s_text.text)
word_count('https://mumbai.craigslist.org/')
OUTPUT:
community
activities
artists
childcare
classes
events
general
groups
local news
lost+found
missed connections
musicians
pets
.
.
.
Upvotes: 3
Reputation: 1373
You are targeting the wrong elements.
if you use
print(source)
Everything works fine but the moment you try to target the element with findAll you are targeting something wrong because you get an empty list array.
If you replace
for post_text in source.findAll('a', {'class':'txt'}):
with
for post_text in source.find_all('a'):
everyting seems to work fine
Upvotes: 2