Reputation: 147
I am trying to scrape some precise lines and create table from collected data (url attached), but cannot get more than the entire body text. Thus, I got stuck.
To give some example:
I would like to arrive at the below table, scraping details from the body content.All the details are there, however any help on how to retrieve them in a form given below would be much appreciated.
My code is:
import requests
from bs4 import BeautifulSoup
# providing url
url = 'https://www.polskawliczbach.pl/wies_Baniocha'
# creating request object
req = requests.get(url)
# creating soup object
data = BeautifulSoup(req.text, 'html')
# finding all li tags in ul and printing the text within it
data1 = data.find('body')
for li in data1.find_all("li"):
print(li.text, end=" ")
Upvotes: 1
Views: 83
Reputation: 3987
At first find the ul
and then try to find li
inside ul
. Scrape needed data, save scraped data in variable and make table using pandas
. Now we have done all things if you want to save table then save it in csv
file otherwise just print it.
Here's the code implementation of all above things:
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get('https://www.polskawliczbach.pl/wies_Baniocha')
soup = BeautifulSoup(page.content, 'lxml')
lis=soup.find_all("ul",class_="list-group row")[1].find_all("li")[1:-1]
dic={"name":[],"value":[]}
for li in lis:
try:
dic["name"].append(li.find(text=True,recursive=False).strip())
dic["value"].append(li.find("span").text.replace(" ",""))
print(li.find(text=True,recursive=False).strip(),li.find("span").text.replace(" ",""))
except:
pass
df=pd.DataFrame(dic)
print(df)
# If you want to save this as file then uncomment following line:
# df.to_csv("<FILENAME>.csv")
And additionally if you want to scrape all then "categories", I don't understand that language so,I don't know which is useful and which is not but anyway here's the code, you can just change this part of above code:
soup = BeautifulSoup(page.content, 'lxml')
dic={"name":[],"value":[]}
lis=soup.find_all("ul",class_="list-group row")
for li in lis:
a=li.find_all("li")[1:-1]
for b in a:
error=0
try:
print(b.find(text=True,recursive=False).strip(),"\t",b.find("span").text.replace(" ","").replace(",",""))
dic["name"].append(b.find(text=True,recursive=False).strip())
dic["value"].append(b.find("span").text.replace(" ","").replace(",",""))
except Exception as e:
pass
df=pd.DataFrame(dic)
Upvotes: 1
Reputation: 3400
Find main tag by specific class and from it find all li
tag
main_data=data.find("ul", class_="list-group").find_all("li")[1:-1]
names=[]
values=[]
main_values=[]
for i in main_data:
values.append(i.find("span").get_text())
names.append(i.find(text=True,recursive=False))
main_values.append(values)
For table representation use pandas
module
import pandas as pd
df=pd.DataFrame(columns=names,data=main_values)
df
Output:
Liczba mieszkańców (2011) Kod pocztowy Numer kierunkowy
0 1 935 05-532 (+48) 22
Upvotes: 1