Alex Güemez
Alex Güemez

Reputation: 27

Parsing the html of the child element [BeautifulSoup]

I have only two weeks learning python.

I'm scraping an XML file and one of the elements of the loop [item->description], have HTML inside, how could I get the text inside p?

url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")

items=soup.findAll('item')

for item in items:
  html_text=item.description
  # This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>

This next line could work, BUT I got some internal, external links and images, which isn't required.

desc=item.description.get_text()

So, if I make a loop o trying to get all the p, it doesn't work.

for p in html_text.find_all('p'):
  print(p)

AttributeError: 'NoneType' object has no attribute 'find_all'

Thank you so much!

Upvotes: 0

Views: 444

Answers (2)

JHeth
JHeth

Reputation: 8346

The issue is how bs4 processes CData (it's pretty well documented but not very solved).

You'll need to import CData from bs4 which will help extract the CData as a string and use the html.parser library, from there create a new bs4 object with that string to give it a findAll attribute and iterate over it's contents.

from bs4 import BeautifulSoup, CData
import requests

url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')

items=soup.findAll('item')

for item in items:
  html_text = item.description
  findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
  newSoup = BeautifulSoup(findCdata, 'html.parser')
  paragraphs = newSoup.findAll('p')
  for p in paragraphs:
    print(p.get_text())

Edit: OP needed to extract link text and found that to only be possible inside the item loop using link = item.link.nextSibling because the link content was jumping outside of its tag like so </link>http://www.... In XML tree view this particular XML doc showed a drop down for the link element which is likely the cause.

To get content from other tags inside the document that don't show a dropdown in XML tree view and don't have nested CData convert the tag to lowercase and return the text as usual:

item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>

Upvotes: 1

Wonka
Wonka

Reputation: 1886

this should look like this:

for item in items:
    html_text=item.description #??

    #!! dont use html_text.find_all !!
    for p in item.find_all('p'):
        print(p)

Upvotes: 0

Related Questions