Sebastian.W89
Sebastian.W89

Reputation: 43

Python scraping from table

since a lot of time I´m trying to find a solution. I wand to get a table from a website and put this table into a pandas DataFrame. I already got the header (thead) of the table. But I´m lost with the tbody respective the tr and td tags.

I think that for the dataframe I first need a dictionary of lists: dict{'state1':['value1',value2'...], 'state2':['value1','value2'..], ... }

and combine this later with the headers into a DataFrame.

Now to the import: First I import the packages:

from bs4 import BeautifulSoup
import requests
import pandas as pd

after that I make parse the url and define the body which I need:

url='https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html'
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')

content=soup.find(attrs={'id':'wrapperDivisions-2'})
table=content.table
body=table.tbody

(I think I could also write body=content.table.body)

After that I tried two different findings.

finding no 1:

key=[]
values=[]
parent=body.find_all(name='tr')
child=body.find_all(name='tr')[0].find_all(name='td')



for i in range(1,len(parent)):
   
    for j in range(1,len(child)+1):
        if j==1:
            key.append(body.select_one('#main > div.text > table > tbody > tr:nth-child({}) > td:nth-child({})'
                                       .format(i,j)).text)
        else:
            values.append(body.select_one('#main > div.text > table > tbody > tr:nth-child({}) > td:nth-child({})'
                                          .format(i,j)).text)
  

Output:

--key--
['Baden-Württem\xadberg', 'Bayern', 'Berlin', 'Branden\xadburg', 'Bremen', 'Hamburg', 'Hessen', 'Meck\xadlenburg-\nVor\xadpommern', 'Nieder\xadsachsen', 'Nord\xadrhein-West\xadfalen', 'Rhein\xadland-Pfalz', 'Saarland', 'Sachsen', 'Sachsen-Anhalt', 'Schles\xadwig-Holstein', 'Thüringen']
--values--
['312.399', '874', '5.054', '46', '7.985', '430.386', '950', '7.243', '55', '12.233', '127.483', '296', '1.973', '54', '2.766', '75.337', '292', '1.592', '63', '2.948', '17.584', '66', '490', '72', '329', '50.673', '162', '1.176', '64', '1.235', '185.434', '681', '3.789', '60', '5.751', '23.712', '205', '1.038', '65', '714', '160.402', '749', '4.817', '60', '4.204', '524.022', '1.838', '10.926', '61', '12.800', '100.633', '279', '2.124', '52', '3.031', '28.369', '134', '606', '61', '852', '191.563', '514', '2.696', '66', '7.529', '59.049', '232', '1.787', '81', '2.339', '41.468', '258', '1.454', '50', '1.245', '74.304', '477', '2.530', '119', '2.779']

The solution looks good, but I only get two lists. One list with the state and one list with the values. Here I have my problems to get this into the desired format as a dictionary.

finding no. 2:

   key=[]
    values=[]
    for i in range(1,len(parent)):
    x=body.select('#main > div.text > table > tbody > tr:nth-child({})'.format(i))
    print('---')
    for j in x[0].find_all(name='td'):
    key.append()
    print(j.text)
    print('+++')      

The output looks like this:

---
Baden-Württem­berg
+++
312.399
+++
874
+++
5.054
+++
46
+++
7.985
+++
---
Bayern
+++
430.386
+++
950
+++
7.243
+++
55
+++
12.233
+++
---

Here it seems to be exactly what I want. But my problem is, that I can not assign it to the key/values. So I can not get it into the format for build a dictionary and read it into a dataframe. I hope that it is no problem that the page is written in german.

I will be happy about every hint for a clean solution. Many thanks in advance.

Upvotes: 1

Views: 42

Answers (1)

Bertrand Martel
Bertrand Martel

Reputation: 45372

pandas has a read_html method for extracting data from a table :

import pandas as pd
import requests

r = requests.get("https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html")
data = pd.read_html(r.text)
print(data)

Upvotes: 2

Related Questions