Reputation: 43
since a lot of time I´m trying to find a solution. I wand to get a table from a website and put this table into a pandas DataFrame. I already got the header (thead) of the table. But I´m lost with the tbody respective the tr and td tags.
I think that for the dataframe I first need a dictionary of lists: dict{'state1':['value1',value2'...], 'state2':['value1','value2'..], ... }
and combine this later with the headers into a DataFrame.
Now to the import: First I import the packages:
from bs4 import BeautifulSoup
import requests
import pandas as pd
after that I make parse the url and define the body which I need:
url='https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html'
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')
content=soup.find(attrs={'id':'wrapperDivisions-2'})
table=content.table
body=table.tbody
(I think I could also write body=content.table.body)
After that I tried two different findings.
finding no 1:
key=[]
values=[]
parent=body.find_all(name='tr')
child=body.find_all(name='tr')[0].find_all(name='td')
for i in range(1,len(parent)):
for j in range(1,len(child)+1):
if j==1:
key.append(body.select_one('#main > div.text > table > tbody > tr:nth-child({}) > td:nth-child({})'
.format(i,j)).text)
else:
values.append(body.select_one('#main > div.text > table > tbody > tr:nth-child({}) > td:nth-child({})'
.format(i,j)).text)
Output:
--key--
['Baden-Württem\xadberg', 'Bayern', 'Berlin', 'Branden\xadburg', 'Bremen', 'Hamburg', 'Hessen', 'Meck\xadlenburg-\nVor\xadpommern', 'Nieder\xadsachsen', 'Nord\xadrhein-West\xadfalen', 'Rhein\xadland-Pfalz', 'Saarland', 'Sachsen', 'Sachsen-Anhalt', 'Schles\xadwig-Holstein', 'Thüringen']
--values--
['312.399', '874', '5.054', '46', '7.985', '430.386', '950', '7.243', '55', '12.233', '127.483', '296', '1.973', '54', '2.766', '75.337', '292', '1.592', '63', '2.948', '17.584', '66', '490', '72', '329', '50.673', '162', '1.176', '64', '1.235', '185.434', '681', '3.789', '60', '5.751', '23.712', '205', '1.038', '65', '714', '160.402', '749', '4.817', '60', '4.204', '524.022', '1.838', '10.926', '61', '12.800', '100.633', '279', '2.124', '52', '3.031', '28.369', '134', '606', '61', '852', '191.563', '514', '2.696', '66', '7.529', '59.049', '232', '1.787', '81', '2.339', '41.468', '258', '1.454', '50', '1.245', '74.304', '477', '2.530', '119', '2.779']
The solution looks good, but I only get two lists. One list with the state and one list with the values. Here I have my problems to get this into the desired format as a dictionary.
finding no. 2:
key=[]
values=[]
for i in range(1,len(parent)):
x=body.select('#main > div.text > table > tbody > tr:nth-child({})'.format(i))
print('---')
for j in x[0].find_all(name='td'):
key.append()
print(j.text)
print('+++')
The output looks like this:
---
Baden-Württemberg
+++
312.399
+++
874
+++
5.054
+++
46
+++
7.985
+++
---
Bayern
+++
430.386
+++
950
+++
7.243
+++
55
+++
12.233
+++
---
Here it seems to be exactly what I want. But my problem is, that I can not assign it to the key/values. So I can not get it into the format for build a dictionary and read it into a dataframe. I hope that it is no problem that the page is written in german.
I will be happy about every hint for a clean solution. Many thanks in advance.
Upvotes: 1
Views: 42
Reputation: 45372
pandas has a read_html
method for extracting data from a table :
import pandas as pd
import requests
r = requests.get("https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html")
data = pd.read_html(r.text)
print(data)
Upvotes: 2