Reputation: 39
thanks to many posts on stackoverflow I found many ways to get closer to my solution but it seems like I always have the same problem. I only get the first column of a table
Goal: This URL here has only one table, which I'd like to scrape
Here is my code:
# 1. get the html doc
source = requests.get("www.placeholder.com").text
# 2. get the BeautifulSoup object
soup = bs.BeautifulSoup(source, 'lxml')
# 3. find the table
find_class = soup.table
tbody_1 = find_class.tbody
n = 1
m = 1
for row in tbody_1.find_all('tr'):
for col in row.find_all('td'):
if col == "Tag":
print(col)
print(n)
print(m)
print("Tags will be passed")
pass
else:
if n < 13:
value_list = []
value_list.append(col)
print(col)
print(n)
print(m)
val_dict[m] = value_list
n = n+1
# m = m
else:
value_list = []
value_list.append(col)
print(col)
print(n)
print(m)
val_dict[m+1] = value_list
n = 1
m = m+1
This gave me the problem of: Tags
Using:
value_list.append(col.select('span')[0].get_text())
Lead to the problem of: First Item. Here only the first item of each row was used
Inspiration by (answers in) inter alia this link
for row in table.find_all('tr'):
for col in row.find_all('td'):
I'll edit the post when for whatever is needed in addition to what I provided.
Upvotes: 2
Views: 197
Reputation: 20042
If all you want is the table, then I'd recommend exploring pandas
and making your (scraping) life easier.
Here's how:
import pandas as pd
import requests
source_url = "https://www.placeholder.com"
page = requests.get(source_url).text
df = pd.read_html(page, flavor="bs4")
pd.concat(df).to_csv("demographischer_statistik.csv", index=False)
This outputs a .csv
file that looks like this:
And if you're into so-called one-liners
the above code can effectively be reduced to:
import pandas as pd
import requests
pd.concat(pd.read_html(requests.get("https://placeholder.com").text, flavor="bs4")).to_csv("demographischer_statistik.csv", index=False)
But that's not too readable, if you ask me. ;)
Upvotes: 3