mant0
mant0

Reputation: 33

web scraping a certain row from a table

I am struggling in scraping a certain row from this website.

First of all the table element has no class in it but I think i got a workaround to that.

My problem is that I want to print (or store in a variable or access the data) of a certain row, Let's say the row with the first value "Bollate": Screenshot of the row in the website

So I coded:

import requests
import bs4

URL = "http://www.centrometeolombardo.com/content.asp?CatId=332&ContentType=Dati"

response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, "lxml")

table = soup.find(text="Bollate").find_parent("table")

for a in table:
    if a.text == "Bollate":
       for val in a.parent-find_next_siblings():
           print(val.text)

But I get getting:

Traceback (most recent call last):
  File "/home/pi/Documents/Python/ngu.py", line 12, in <module>
   if a.text == "Bollate":
  File "/usr/lib/Python3/dist-packages/bs4/element.py", line 370, in _getattr_
   self._class_._name_, attr))
AttributeError: 'NavigableString' object has no attribute 'text'

Which suggests me I am wrong since I get something that is not a text but I do not know how to overcome the problem.

Thanks all

Upvotes: 3

Views: 1489

Answers (2)

QHarr
QHarr

Reputation: 84465

You can isolate the row with :contains and :has to ensure b tag with that text within a tr. You also need to target the right nested table e.g. with nth-child

import requests
from bs4 import BeautifulSoup

page_source = requests.get('http://www.centrometeolombardo.com/content.asp?CatId=332&ContentType=Dati').text
soup = BeautifulSoup(page_source, 'lxml')
print([td.get_text(strip=True) for td in soup.select('div:nth-child(5) table:nth-child(3) tr:has(b:contains("Bollate")) td')])
    

Thanks to @SIM for pointing out one could avoid hardcoding an index by using the following pattern instead:

soup.select("table > tr:has(> td > a:contains('Bollate')) td")

Upvotes: 3

baduker
baduker

Reputation: 20042

You can use pandas to grab the HTML and parse the table. Then just select the value you need.

Here's how:

import pandas as pd

url = "http://www.centrometeolombardo.com/content.asp?CatId=332&ContentType=Dati"
df = pd.read_html(url, flavor="bs4")[19]
print(df.loc[df[0] == "Bollate"])

Output:

         0     1     2      3  4  5
2  Bollate  -0.3  12.3  Brina  -  -

Upvotes: 3

Related Questions