How to parse data using BeautifulSoup4?

Question

Following is the Sample from .xml file:

    Kaufsignal für Marriott International
    https://insideparadeplatz.ch/2015/03/06/kaufsignal-fuer-marriott-international/
    Fri, 06 Mar 2015 
    
        
                Mit Marken wie Bulgari, Ritz-Carlton, Marriott und weiteren ist Marriott International nach sämtlichen Kriterien, die vom 
                Obermatt-System bewertet werden, ein interessantes Investment. Der Titel ist relativ gesehen günstig, das Unternehmen sollte weiter überproportional wachsen, und es ist solide finanziert, mit einem guten Verhältnis von Eigenkapital und Schulden. Über alle Kategorien gesehen landet die 
                Marriott-Aktie, die derzeit an der Technologiebörse Nasdaq bei rund 84 Dollar gehandelt wird, in der Wochenauswertung im Total-Ranking auf dem ersten Platz.

                ]]>

What I'm trying to do is , using beautifulsoup4, I'm able to extract 'title', 'link', 'pubDate'. But problem is 'content:encoded'. Here I want to extract 'img' from 'content:encoded' for my 'img_list'. I've tried many solutions but all I get is None.

title = []
link = []
date = []
img_list = []
for item in soup.find_all('item'):
    for t in item.find_all('title'):
        title.append(t.text)
for item in soup.find_all('item'):
    for l in item.find_all('link'):
        link.append(t.text)
for item in soup.find_all('item'):
    for date in item.find_all('pubDate'):
        pubDate.append(date.text)
for item in soup.find_all('item'):
    for data in item.find_all('content:encoded'):
        data.text

I tried:

for item in soup.find_all('item'):
    for data in item.find_all('content:encoded'):
        for img in data.find_all('img'):
            img_list.append(img.text)

but got nothing. What I'm missing here?

JeffCharter · Accepted Answer

I think your going to have trouble getting that img data out.

for item in soup.find("content:encoded"):
   print(item)
   print(type(item))

Then see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring

So bs4 thinks it is a string and you will need to parse it manually or maybe refeed the new string into a new bs4 object

How to parse data using BeautifulSoup4?

Answers (1)

Related Questions