Python BeautifulSoup extraction

Question

I have used the following code to access the description that is posted bellow.

Here is the code:

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.meteoclimatic.net/feed/rss/ESCYL2400000024153A')
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')
print(items[0].description)

I have obtained the following XML sample:



     <ul>
<li><img src="http://meteoclimatic.net/img/sem_tpv.png" style="width: 12px; height: 12px; border: 0px;" alt="***" /> <a href="http://www.meteoclimatic.net/perfil/ESCYL2400000024153A">Sta Mar&#237;a del Condado</a></li>
<ul>
<li> Actualizado: 24-07-2018 08:20 UTC</li>
<li>Temperatura: <b>23,6</b> &#186;C (
M&#225;x.: <b style="color: red">23,6</b> /
M&#237;n.: <b style="color: blue">12,1</b> )</li>
<li>Humedad: <b>54,0</b> % (
M&#225;x.: <b style="color: red">91,0</b> /
M&#237;n.: <b style="color: blue">54,0</b> )</li>
<li>Bar&#243;metro: <b>1021,0</b> hPa (
M&#225;x.: <b style="color: red">1021,2</b> /
M&#237;n.: <b style="color: blue">1019,9</b> )</li>
<li>Viento: <b>1,0</b> km/h (
M&#225;x.: <b style="color: red">9,0</b> )</li>
<li>Direcci&#243;n del viento: <b>170</b> - S</li>
<li>Precip.: <b>0,0</b> mm</li>
</ul>
     </ul>

I want to extract the items contained between the labels [[]] and [[]]. How could I do that in a "pythonic" way without having to manually parse every item as a string?

Edit:

The data I want to extract may also be found in this part of the soup:

<ul>
<li><img src="http://meteoclimatic.net/img/sem_tpv.png" style="width: 12px; height: 12px; border: 0px;" alt="***" /> <a href="http://www.meteoclimatic.net/perfil/ESCYL2400000024153A">Sta Mar&#237;a del Condado</a></li>
<ul>
<li> Actualizado: 24-07-2018 08:50 UTC</li>
<li>Temperatura: <b>24,4</b> &#186;C (
M&#225;x.: <b style="color: red">24,5</b> /
M&#237;n.: <b style="color: blue">12,1</b> )</li>
<li>Humedad: <b>49,0</b> % (
M&#225;x.: <b style="color: red">91,0</b> /
M&#237;n.: <b style="color: blue">49,0</b> )</li>
<li>Bar&#243;metro: <b>1021,0</b> hPa (
M&#225;x.: <b style="color: red">1021,2</b> /
M&#237;n.: <b style="color: blue">1019,9</b> )</li>
<li>Viento: <b>5,0</b> km/h (
M&#225;x.: <b style="color: red">10,0</b> )</li>
<li>Direcci&#243;n del viento: <b>219</b> - SW</li>
<li>Precip.: <b>0,0</b> mm</li>
</ul>
     </ul>

Andrej Kesely · Accepted Answer

You can do it with BeautifulSoup, using the Comment object:

import requests
from bs4 import BeautifulSoup, Comment

resp = requests.get('https://www.meteoclimatic.net/feed/rss/ESCYL2400000024153A')
soup = BeautifulSoup(resp.content, 'xml')
for item in soup.select('item'):
    comments = item.description.find_all(text=lambda text:isinstance(text, Comment))
    print([c for c in comments[0].split('
') if c][1:-1])

Prints:

['[[]]']

Edit:

This code iterates through all tags. In each tag it will find in all texts, that's instance of Comment object (in other words anything that is between tags. Then it will split first comment according newlines and writes all lines but first and last.

Python BeautifulSoup extraction

Answers (2)

Related Questions