Oscar
Oscar

Reputation: 13

Issues concerning about web scraping with py bs4

I am trying to scrape weather data on the web for learning scraping basics, encountered some problems on the structure of HTML the website included.

I have debugged about the nested structure inside the html page which I am able to show the first data by printing out the d["precip"], but I dont know why the iteration cannot be read by the next loop, again the iteration is still here by print(i) can show it works normally.

Result of first loop:

{'date': '19:30', 'hourly-date': 'Thu', 
'hidden-cell-sm description': 'Mostly Cloudy', 
'temp': '26°', 'feels': '30°', 'precip': '15%', 
'humidity': '84%', 'wind': 'SSE 12 km/h '}

After the first loop:

{'date': 'None', 'hourly-date': 'None', 
'hidden-cell-sm description': 'None', 
'temp': 'None', 'feels': 'None', 'precip': 'None', 
'humidity': 'None', 'wind': 'None'}

HTML side: The value "10" and the "%" is what i want to scrape, I did it in the first iteration, but I don't know why is it turns to None for the second one

<td class="precip" headers="precip" data-track-string="ls_hourly_ls_hourly_toggle" classname="precip">
   <div><span class="icon icon-font iconset-weather-data icon-drop-1" classname="icon icon-font iconset-weather-data icon-drop-1"></span>
      <span class="">
        <span>
          10
          <span class="Percentage__percentSymbol__2Q_AR">
            %
          </span>
        </span> 
      </span>
   </div>
</td>

Python codes

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
#all = soup.find("div", {"class": "locations-title hourly-page-title"}).find("h1").text
table = soup.find_all("table", {"class": "twc-table"})
for items in table:
    for i in range(len(items.find_all("tr")) - 1):
        d = {}
        try:
            d["date"] = items.find_all("span", {"class": "dsx-date"})[i].text
            d["hourly-date"] = items.find_all("div", {"class": "hourly-date"})[i].text
            d["hidden-cell-sm description"] = items.find_all("td", {"class": "hidden-cell-sm description"})[i].text
            d["temp"] = items.find_all("td", {"class": "temp"})[i].text
            d["feels"] = items.find_all("td", {"class": "feels"})[i].text

            #issue starts from here
            inclass = items.find_all("td", {"class": "precip"})[i]
            realtext = inclass.find_all("div", "")[i]
            d["precip"] = realtext.find_all("span", {"class": ""})[i].text
            #issue end

            d["humidity"] = items.find_all("td", {"class": "humidity"})[i].text
            d["wind"] = items.find_all("td", {"class": "wind"})[i].text
            
        except:
            d["date"] = "None"
            d["hourly-date"] = "None"
            d["hidden-cell-sm description"] = "None"
            d["temp"] = "None"
            d["precip"] = "None"
            d["feels"] = "None"
            d["precip"] = "None"
            d["humidity"] = "None"
            d["wind"] = "None"
            
        total.append(d)
        
df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind'])

I expected to scrape all the data, but as shown above, the "precip" is missing but others are still here. For more information, here are the results

     Date weekdays    Description temp feels  percip humidity          wind
0   19:30      Thu  Mostly Cloudy  26°   30°     NaN      84%  SSE 12 km/h 
1   20:00      Thu  Mostly Cloudy  26°   30°     NaN      86%  SSE 11 km/h 
2   21:00      Thu  Mostly Cloudy  26°   30°     NaN      86%  SSE 12 km/h 
3   22:00      Thu  Mostly Cloudy  26°   29°     NaN      86%  SSE 12 km/h 
4   23:00      Thu         Cloudy  26°   29°     NaN      87%  SSE 12 km/h 
5   00:00      Fri         Cloudy  26°   29°     NaN      87%    S 12 km/h 
6   01:00      Fri         Cloudy  26°   26°     NaN      88%    S 12 km/h 
7   02:00      Fri         Cloudy  26°   26°     NaN      87%    S 12 km/h 
8   03:00      Fri         Cloudy  29°   35°     NaN      87%    S 12 km/h 
9   04:00      Fri  Mostly Cloudy  29°   35°     NaN      87%    S 12 km/h 
10  05:00      Fri  Mostly Cloudy  28°   35°     NaN      87%  SSW 11 km/h 
11  06:00      Fri  Mostly Cloudy  28°   34°     NaN      88%  SSW 11 km/h 
12  07:00      Fri  Mostly Cloudy  29°   35°     NaN      87%  SSW 10 km/h 
13  08:00      Fri  Mostly Cloudy  29°   36°     NaN      84%  SSW 12 km/h 
14  09:00      Fri  Mostly Cloudy  29°   37°     NaN      82%  SSW 13 km/h 
15  10:00      Fri  Partly Cloudy  30°   37°     NaN      81%  SSW 14 km/h 

Newbie here, I am wiling to learn and please tell me how my code structure can be improved. Thanks alot

Upvotes: 1

Views: 100

Answers (2)

bharatk
bharatk

Reputation: 4315

find_all function always return a list, strip() is remove spaces at the beginning and at the end of the string. and percip define wrong lable in df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind']) because you define d["precip"] = "None" in dictionary.

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
tables = soup.find_all("table", {"class": "twc-table"})
for table in tables:
    for tr in table.find("tbody").find_all("tr"):
        d = {"date":"None","hourly-date":"None","hidden-cell-sm description":"None","temp":"None","precip":"None",\
             "feels":"None","precip":"None","humidity":"None","wind":"None"}

        for td in tr.find_all("td"):
            try:
                _class = td.get("class")
                if len(_class) > 1:
                    temp = 0
                    for cc in _class:
                        if "cell-hide" in cc:
                            temp+=1
                            break
                    if temp > 0:
                        continue

                if len(_class)>1 and  "description" in _class[1]:
                    d["hidden-cell-sm description"] = td.text.strip()

                elif _class[0] in "temp":
                    d["temp"] = td.text.strip()

                elif "feels" in _class[0]:
                    d["feels"] = td.text.strip()

                elif "precip" in _class[0]:
                    d["precip"] = td.text.strip()

                elif "humidity" in _class[0]:
                    d["humidity"] = td.text.strip()

                elif "wind" in _class[0]:
                    d["wind"] = td.text.strip()

                else:
                    d["date"] = td.find("span", {"class": "dsx-date"}).text.strip()
                    d["hourly-date"] = td.find("div", {"class": "hourly-date"}).text.strip()
            except:
                pass

        total.append(d)

df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'precip', 'humidity', 'wind'])
print(df)

O/P:

     Date weekdays    Description temp feels precip humidity         wind
0   20:30      Thu  Mostly Cloudy  26°   30°    10%      85%  SSE 12 km/h
1   21:00      Thu  Mostly Cloudy  26°   30°     5%      85%  SSE 12 km/h
2   22:00      Thu  Mostly Cloudy  26°   30°     0%      85%  SSE 12 km/h
3   23:00      Thu  Mostly Cloudy  26°   29°     0%      87%  SSE 12 km/h
4   00:00      Fri         Cloudy  26°   29°     0%      87%    S 12 km/h
5   01:00      Fri         Cloudy  26°   26°     5%      88%    S 12 km/h
6   02:00      Fri         Cloudy  26°   26°    15%      88%    S 12 km/h
7   03:00      Fri  Mostly Cloudy  25°   25°    20%      88%    S 10 km/h
8   04:00      Fri  Mostly Cloudy  25°   29°    25%      88%    S 10 km/h
9   05:00      Fri  Mostly Cloudy  25°   28°    25%      88%  SSW 10 km/h
10  06:00      Fri  Mostly Cloudy  25°   28°    25%      89%  SSW 10 km/h
11  07:00      Fri  Mostly Cloudy  26°   29°    25%      88%  SSW 10 km/h
12  08:00      Fri  Mostly Cloudy  26°   29°    25%      84%  SSW 11 km/h
13  09:00      Fri  Partly Cloudy  27°   30°    25%      82%  SSW 12 km/h
14  10:00      Fri  Partly Cloudy  27°   30°    25%      81%  SSW 14 km/h
15  11:00      Fri  Partly Cloudy  27°   31°    15%      78%  SSW 15 km/h

Upvotes: 0

SIM
SIM

Reputation: 22440

Your precip variable finds nothing and that is what you result shows. To get around this issue, you can use this class Percentage__percentSymbol__2Q_AR and then go for it's previous_sibling to extract the required content. I've tried to show you the portion below that you were facing trouble with.

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
soup = BeautifulSoup(page.text, "html.parser")
total = []
for tr in soup.find("table",class_="twc-table").tbody.find_all("tr"):
    d = {}
    d["date"] = tr.find("span", class_="dsx-date").text
    d["precip"] = tr.find("span", class_="Percentage__percentSymbol__2Q_AR").previous_sibling
    total.append(d)

df = pandas.DataFrame(total,columns=['date','precip'])
print(df)

Upvotes: 1

Related Questions