Reputation: 27
I have only two weeks learning python.
I'm scraping an XML file and one of the elements of the loop [item->description], have HTML inside, how could I get the text inside p?
url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")
items=soup.findAll('item')
for item in items:
html_text=item.description
# This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>
This next line could work, BUT I got some internal, external links and images, which isn't required.
desc=item.description.get_text()
So, if I make a loop o trying to get all the p, it doesn't work.
for p in html_text.find_all('p'):
print(p)
AttributeError: 'NoneType' object has no attribute 'find_all'
Thank you so much!
Upvotes: 0
Views: 444
Reputation: 8346
The issue is how bs4 processes CData (it's pretty well documented but not very solved).
You'll need to import CData from bs4 which will help extract the CData as a string and use the html.parser library, from there create a new bs4 object with that string to give it a findAll attribute and iterate over it's contents.
from bs4 import BeautifulSoup, CData
import requests
url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')
items=soup.findAll('item')
for item in items:
html_text = item.description
findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
newSoup = BeautifulSoup(findCdata, 'html.parser')
paragraphs = newSoup.findAll('p')
for p in paragraphs:
print(p.get_text())
Edit:
OP needed to extract link text and found that to only be possible inside the item loop using link = item.link.nextSibling
because the link content was jumping outside of its tag like so </link>http://www...
. In XML tree view this particular XML doc showed a drop down for the link element which is likely the cause.
To get content from other tags inside the document that don't show a dropdown in XML tree view and don't have nested CData convert the tag to lowercase and return the text as usual:
item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>
Upvotes: 1
Reputation: 1886
this should look like this:
for item in items:
html_text=item.description #??
#!! dont use html_text.find_all !!
for p in item.find_all('p'):
print(p)
Upvotes: 0