Legan
Legan

Reputation: 39

Python iteration over table with beautifulsoup gives only first column

thanks to many posts on stackoverflow I found many ways to get closer to my solution but it seems like I always have the same problem. I only get the first column of a table

Goal: This URL here has only one table, which I'd like to scrape

Here is my code:

# 1. get the html doc 
source = requests.get("www.placeholder.com").text

# 2. get the BeautifulSoup object
soup = bs.BeautifulSoup(source, 'lxml')

# 3. find the table
find_class = soup.table
tbody_1 = find_class.tbody


n = 1
m = 1
for row in tbody_1.find_all('tr'):
    for col in row.find_all('td'):
        if col == "Tag":
            print(col)
            print(n)
            print(m)
            print("Tags will be passed")
            pass
        else:
            if n < 13:
                value_list = []
                value_list.append(col)
                print(col)
                print(n)
                print(m)
                val_dict[m] = value_list
                n = n+1
                # m = m
            else:
                value_list = []                
                value_list.append(col)
                print(col)
                print(n)
                print(m)
                val_dict[m+1] = value_list
                n = 1
                m = m+1

This gave me the problem of: Tags

Using:

value_list.append(col.select('span')[0].get_text())

Lead to the problem of: First Item. Here only the first item of each row was used

Inspiration by (answers in) inter alia this link

for row in table.find_all('tr'):
    for col in row.find_all('td'):

I'll edit the post when for whatever is needed in addition to what I provided.

Upvotes: 2

Views: 197

Answers (1)

baduker
baduker

Reputation: 20042

If all you want is the table, then I'd recommend exploring pandas and making your (scraping) life easier.

Here's how:

import pandas as pd
import requests

source_url = "https://www.placeholder.com"
page = requests.get(source_url).text
df = pd.read_html(page, flavor="bs4")
pd.concat(df).to_csv("demographischer_statistik.csv", index=False)

This outputs a .csv file that looks like this:

enter image description here

And if you're into so-called one-liners the above code can effectively be reduced to:

import pandas as pd
import requests

pd.concat(pd.read_html(requests.get("https://placeholder.com").text, flavor="bs4")).to_csv("demographischer_statistik.csv", index=False)

But that's not too readable, if you ask me. ;)

Upvotes: 3

Related Questions