Reputation: 13
I am trying to scrape weather data on the web for learning scraping basics, encountered some problems on the structure of HTML the website included.
I have debugged about the nested structure inside the html page which I am able to show the first data by printing out the d["precip"]
, but I dont know why the iteration cannot be read by the next loop, again the iteration is still here by print(i)
can show it works normally.
Result of first loop:
{'date': '19:30', 'hourly-date': 'Thu',
'hidden-cell-sm description': 'Mostly Cloudy',
'temp': '26°', 'feels': '30°', 'precip': '15%',
'humidity': '84%', 'wind': 'SSE 12 km/h '}
After the first loop:
{'date': 'None', 'hourly-date': 'None',
'hidden-cell-sm description': 'None',
'temp': 'None', 'feels': 'None', 'precip': 'None',
'humidity': 'None', 'wind': 'None'}
HTML side: The value "10" and the "%" is what i want to scrape, I did it in the first iteration, but I don't know why is it turns to None for the second one
<td class="precip" headers="precip" data-track-string="ls_hourly_ls_hourly_toggle" classname="precip">
<div><span class="icon icon-font iconset-weather-data icon-drop-1" classname="icon icon-font iconset-weather-data icon-drop-1"></span>
<span class="">
<span>
10
<span class="Percentage__percentSymbol__2Q_AR">
%
</span>
</span>
</span>
</div>
</td>
Python codes
import requests
import pandas
from bs4 import BeautifulSoup
page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
#all = soup.find("div", {"class": "locations-title hourly-page-title"}).find("h1").text
table = soup.find_all("table", {"class": "twc-table"})
for items in table:
for i in range(len(items.find_all("tr")) - 1):
d = {}
try:
d["date"] = items.find_all("span", {"class": "dsx-date"})[i].text
d["hourly-date"] = items.find_all("div", {"class": "hourly-date"})[i].text
d["hidden-cell-sm description"] = items.find_all("td", {"class": "hidden-cell-sm description"})[i].text
d["temp"] = items.find_all("td", {"class": "temp"})[i].text
d["feels"] = items.find_all("td", {"class": "feels"})[i].text
#issue starts from here
inclass = items.find_all("td", {"class": "precip"})[i]
realtext = inclass.find_all("div", "")[i]
d["precip"] = realtext.find_all("span", {"class": ""})[i].text
#issue end
d["humidity"] = items.find_all("td", {"class": "humidity"})[i].text
d["wind"] = items.find_all("td", {"class": "wind"})[i].text
except:
d["date"] = "None"
d["hourly-date"] = "None"
d["hidden-cell-sm description"] = "None"
d["temp"] = "None"
d["precip"] = "None"
d["feels"] = "None"
d["precip"] = "None"
d["humidity"] = "None"
d["wind"] = "None"
total.append(d)
df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind'])
I expected to scrape all the data, but as shown above, the "precip" is missing but others are still here. For more information, here are the results
Date weekdays Description temp feels percip humidity wind
0 19:30 Thu Mostly Cloudy 26° 30° NaN 84% SSE 12 km/h
1 20:00 Thu Mostly Cloudy 26° 30° NaN 86% SSE 11 km/h
2 21:00 Thu Mostly Cloudy 26° 30° NaN 86% SSE 12 km/h
3 22:00 Thu Mostly Cloudy 26° 29° NaN 86% SSE 12 km/h
4 23:00 Thu Cloudy 26° 29° NaN 87% SSE 12 km/h
5 00:00 Fri Cloudy 26° 29° NaN 87% S 12 km/h
6 01:00 Fri Cloudy 26° 26° NaN 88% S 12 km/h
7 02:00 Fri Cloudy 26° 26° NaN 87% S 12 km/h
8 03:00 Fri Cloudy 29° 35° NaN 87% S 12 km/h
9 04:00 Fri Mostly Cloudy 29° 35° NaN 87% S 12 km/h
10 05:00 Fri Mostly Cloudy 28° 35° NaN 87% SSW 11 km/h
11 06:00 Fri Mostly Cloudy 28° 34° NaN 88% SSW 11 km/h
12 07:00 Fri Mostly Cloudy 29° 35° NaN 87% SSW 10 km/h
13 08:00 Fri Mostly Cloudy 29° 36° NaN 84% SSW 12 km/h
14 09:00 Fri Mostly Cloudy 29° 37° NaN 82% SSW 13 km/h
15 10:00 Fri Partly Cloudy 30° 37° NaN 81% SSW 14 km/h
Newbie here, I am wiling to learn and please tell me how my code structure can be improved. Thanks alot
Upvotes: 1
Views: 100
Reputation: 4315
find_all
function always return a list, strip()
is remove spaces at the beginning and at the end of the string. and percip
define wrong lable in df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'percip', 'humidity', 'wind'])
because you define d["precip"] = "None"
in dictionary.
import requests
import pandas
from bs4 import BeautifulSoup
page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
content = page.content
soup = BeautifulSoup(content, "html.parser")
total = []
container = []
tables = soup.find_all("table", {"class": "twc-table"})
for table in tables:
for tr in table.find("tbody").find_all("tr"):
d = {"date":"None","hourly-date":"None","hidden-cell-sm description":"None","temp":"None","precip":"None",\
"feels":"None","precip":"None","humidity":"None","wind":"None"}
for td in tr.find_all("td"):
try:
_class = td.get("class")
if len(_class) > 1:
temp = 0
for cc in _class:
if "cell-hide" in cc:
temp+=1
break
if temp > 0:
continue
if len(_class)>1 and "description" in _class[1]:
d["hidden-cell-sm description"] = td.text.strip()
elif _class[0] in "temp":
d["temp"] = td.text.strip()
elif "feels" in _class[0]:
d["feels"] = td.text.strip()
elif "precip" in _class[0]:
d["precip"] = td.text.strip()
elif "humidity" in _class[0]:
d["humidity"] = td.text.strip()
elif "wind" in _class[0]:
d["wind"] = td.text.strip()
else:
d["date"] = td.find("span", {"class": "dsx-date"}).text.strip()
d["hourly-date"] = td.find("div", {"class": "hourly-date"}).text.strip()
except:
pass
total.append(d)
df = pandas.DataFrame(total)
df = df.rename(index=str, columns={"date": "Date", "hourly-date": "weekdays", "hidden-cell-sm description": "Description"})
df = df.reindex(columns=['Date', 'weekdays', 'Description', 'temp', 'feels', 'precip', 'humidity', 'wind'])
print(df)
O/P:
Date weekdays Description temp feels precip humidity wind
0 20:30 Thu Mostly Cloudy 26° 30° 10% 85% SSE 12 km/h
1 21:00 Thu Mostly Cloudy 26° 30° 5% 85% SSE 12 km/h
2 22:00 Thu Mostly Cloudy 26° 30° 0% 85% SSE 12 km/h
3 23:00 Thu Mostly Cloudy 26° 29° 0% 87% SSE 12 km/h
4 00:00 Fri Cloudy 26° 29° 0% 87% S 12 km/h
5 01:00 Fri Cloudy 26° 26° 5% 88% S 12 km/h
6 02:00 Fri Cloudy 26° 26° 15% 88% S 12 km/h
7 03:00 Fri Mostly Cloudy 25° 25° 20% 88% S 10 km/h
8 04:00 Fri Mostly Cloudy 25° 29° 25% 88% S 10 km/h
9 05:00 Fri Mostly Cloudy 25° 28° 25% 88% SSW 10 km/h
10 06:00 Fri Mostly Cloudy 25° 28° 25% 89% SSW 10 km/h
11 07:00 Fri Mostly Cloudy 26° 29° 25% 88% SSW 10 km/h
12 08:00 Fri Mostly Cloudy 26° 29° 25% 84% SSW 11 km/h
13 09:00 Fri Partly Cloudy 27° 30° 25% 82% SSW 12 km/h
14 10:00 Fri Partly Cloudy 27° 30° 25% 81% SSW 14 km/h
15 11:00 Fri Partly Cloudy 27° 31° 15% 78% SSW 15 km/h
Upvotes: 0
Reputation: 22440
Your precip
variable finds nothing and that is what you result shows. To get around this issue, you can use this class Percentage__percentSymbol__2Q_AR
and then go for it's previous_sibling
to extract the required content. I've tried to show you the portion below that you were facing trouble with.
import requests
import pandas
from bs4 import BeautifulSoup
page = requests.get("https://weather.com/en-IN/weather/hourbyhour/l/0fcc6b573ec19677819071ea104e175b6dfc8f942f59554bc99d10c5cd0dbfe8")
soup = BeautifulSoup(page.text, "html.parser")
total = []
for tr in soup.find("table",class_="twc-table").tbody.find_all("tr"):
d = {}
d["date"] = tr.find("span", class_="dsx-date").text
d["precip"] = tr.find("span", class_="Percentage__percentSymbol__2Q_AR").previous_sibling
total.append(d)
df = pandas.DataFrame(total,columns=['date','precip'])
print(df)
Upvotes: 1