vantabeam
vantabeam

Reputation: 55

Parsing HTML using BeautifulSoup, select()

I'm trying to get the latest post contents using BeautifulSoup.
Sometimes the tag is in a recent post, sometimes it is not.
I'd like to get the tag if it's there and if it's not there, just get other texts.
My code is as below.

import requests
from bs4 import BeautifulSoup

headers = 'User-Agent':'Mozilla/5.0'
url = "https:// " 
req = requests.get(url, headers=headers)
html = req.text       
soup = BeautifulSoup(html, 'html.parser')                
link = soup.select('#flagList > div.clear.ab-webzine > div > a')       
title = soup.select('#flagList > div.clear.ab-webzine > div > div.wz-item-header > a > span')         
latest_link = link[0] # link of latest post    
latest_title = title[0].text # title of latest post

# to get the text of latest post
t_url = latest_link
t_req = requests.get(t_url, headers=headers)
t_html = c_res.text
t_soup = BeautifulSoup(t_html, 'html.parser')  
maintext = t_soup.select ('#flagArticle > div.rhymix_content.xe_content')
tag = t_soup.select_one('div.rd.clear > div.rd_body.clear > ul > li > a').get_text()

print(maintext)
print(tag)

The problem is, if there is no tag in the recent post, it returns error as follows.
AttributeError: 'NoneType' object has no attribute 'get_text'

If I delete .get_text()from that code and the tag is not in the recent post, it returns None
And If the tag exists, it returns <a href="/posts?search_target=tag&amp;search_keyword=ABC">ABC</a>
But I want to get just ABC

How can I fix this problem?

Upvotes: 1

Views: 112

Answers (1)

Harshit Jindal
Harshit Jindal

Reputation: 621

Try this

import requests
from bs4 import BeautifulSoup

headers = 'User-Agent':'Mozilla/5.0'
url = "https:// " 
req = requests.get(url, headers=headers)
html = req.text       
soup = BeautifulSoup(html, 'html.parser')                
link = soup.select('#flagList > div.clear.ab-webzine > div > a')       
title = soup.select('#flagList > div.clear.ab-webzine > div > div.wz-item-header > a > span')         
latest_link = link[0] # link of latest post    
latest_title = title[0].text # title of latest post

# to get the text of latest post
t_url = latest_link
t_req = requests.get(t_url, headers=headers)
t_html = c_res.text
t_soup = BeautifulSoup(t_html, 'html.parser')  
maintext = t_soup.select ('#flagArticle > div.rhymix_content.xe_content')
try:
    tag = t_soup.select_one('div.rd.clear > div.rd_body.clear > ul > li > a').text
    print(tag)
except:
    print("Sure the tag exists on this page??")

print(maintext)

Upvotes: 1

Related Questions