Reputation: 29
I am trying to scrape data from websites
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.mohfw.gov.in/"
r = requests.get(url)
html =r.text
soup = BeautifulSoup(html,'html.parser')
#print(soup)
id = soup.find('div',id='cases')
table_body = id.find('tbody')
table_rows = table_body.find_all('tr')
sl_no = []
States = []
Cases = []
Recovered = []
Deaths = []
Trying to loop and add table rows to the above blank columns but getting errors
for tr in table_rows:
td = tr.find_all('td')
sl_no.append(td[0].text)
States.append(td[1].text)
Cases.append(td[2].text)
Recovered.append(td[3].text)
Deaths.append(td[-1].text)
headers = ['sl_no','States','Cases','Recovered','Deaths']
df = pd.DataFrame(list(zip(sl_no,States,Cases,Recovered,Deaths)),columns=headers)
df1 = df.drop(index=27)
This is my error
States.append(td[1].text)
IndexError: list index out of range
Upvotes: 0
Views: 58
Reputation: 138
It's seems that one of the <tr>
does not contain all the <td>
s that you thought it should.
From a quick look on the data itself, it's seems to be that the last <tr>
of that data, contains some kind of a summary for all the states.
In that case you should probably cut the last <td>
off your for loop:
for tr in table_rows[:-1]
Or wrap it with:
for tr in table_rows:
try:
td = tr.find_all('td')
sl_no.append(td[0].text)
States.append(td[1].text)
Cases.append(td[2].text)
Recovered.append(td[3].text)
Deaths.append(td[-1].text)
except Exception as e:
# Pass or handle the exception as you wish.
pass
Upvotes: 0
Reputation: 862651
You can test lengths of td
lists, problem is last is length 1
, so error raise for select second value of list by td[1]
:
for tr in table_rows:
td = tr.find_all('td')
print (len(td))
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
4
1
So your solution should be changed with filtering all td values with length 5:
for tr in table_rows:
td = tr.find_all('td')
if len(td) == 5:
sl_no.append(td[0].text)
States.append(td[1].text)
Cases.append(td[2].text)
Recovered.append(td[3].text)
Deaths.append(td[-1].text)
headers = ['sl_no','States','Cases','Recovered','Deaths']
df = pd.DataFrame(list(zip(sl_no,States,Cases,Recovered,Deaths)),columns=headers)
print (df)
sl_no States Cases Recovered Deaths
0 1 Andhra Pradesh 23 1 0
1 2 Andaman and Nicobar Islands 9 0 0
2 3 Bihar 15 0 1
3 4 Chandigarh 8 0 0
4 5 Chhattisgarh 7 0 0
5 6 Delhi 87 6 2
6 7 Goa 5 0 0
7 8 Gujarat 69 1 6
8 9 Haryana 36 18 0
9 10 Himachal Pradesh 3 0 1
10 11 Jammu and Kashmir 48 2 2
11 12 Karnataka 83 5 3
12 13 Kerala 202 19 1
13 14 Ladakh 13 3 0
14 15 Madhya Pradesh 47 0 3
15 16 Maharashtra 198 25 8
16 17 Manipur 1 0 0
17 18 Mizoram 1 0 0
18 19 Odisha 3 0 0
19 20 Puducherry 1 0 0
20 21 Punjab 38 1 1
21 22 Rajasthan 59 3 0
22 23 Tamil Nadu 67 4 1
23 24 Telengana 71 1 1
24 25 Uttarakhand 7 2 0
25 26 Uttar Pradesh 82 11 0
26 27 West Bengal 22 0 2
I think you can simplify your code with read_html
:
url = "https://www.mohfw.gov.in/"
df = pd.read_html(url)[-1]
And then remove last 2 rows:
df = df.iloc[:-2]
print (df)
S. No. Name of State / UT Total Confirmed cases * \
0 1 Andhra Pradesh 23
1 2 Andaman and Nicobar Islands 9
2 3 Bihar 15
3 4 Chandigarh 8
4 5 Chhattisgarh 7
5 6 Delhi 87
6 7 Goa 5
7 8 Gujarat 69
8 9 Haryana 36
9 10 Himachal Pradesh 3
10 11 Jammu and Kashmir 48
11 12 Karnataka 83
12 13 Kerala 202
13 14 Ladakh 13
14 15 Madhya Pradesh 47
15 16 Maharashtra 198
16 17 Manipur 1
17 18 Mizoram 1
18 19 Odisha 3
19 20 Puducherry 1
20 21 Punjab 38
21 22 Rajasthan 59
22 23 Tamil Nadu 67
23 24 Telengana 71
24 25 Uttarakhand 7
25 26 Uttar Pradesh 82
26 27 West Bengal 22
Cured/Discharged/Migrated Death
0 1 0
1 0 0
2 0 1
3 0 0
4 0 0
5 6 2
6 0 0
7 1 6
8 18 0
9 0 1
10 2 2
11 5 3
12 19 1
13 3 0
14 0 3
15 25 8
16 0 0
17 0 0
18 0 0
19 0 0
20 1 1
21 3 0
22 4 1
23 1 1
24 2 0
25 11 0
26 0 2
Upvotes: 1