Jaroslaw
Jaroslaw

Reputation: 147

Issue with scraping in python

I am trying to scrape some precise lines and create table from collected data (url attached), but cannot get more than the entire body text. Thus, I got stuck.

To give some example:

I would like to arrive at the below table, scraping details from the body content.All the details are there, however any help on how to retrieve them in a form given below would be much appreciated.

enter image description here

My code is:

import requests
from bs4 import BeautifulSoup
# providing url
url = 'https://www.polskawliczbach.pl/wies_Baniocha'

# creating request object
req = requests.get(url)

# creating soup object
data = BeautifulSoup(req.text, 'html')

# finding all li tags in ul and printing the text within it
data1 = data.find('body')
for li in data1.find_all("li"):
   print(li.text, end=" ")

enter image description here

enter image description here

Upvotes: 1

Views: 83

Answers (2)

imxitiz
imxitiz

Reputation: 3987

At first find the ul and then try to find li inside ul. Scrape needed data, save scraped data in variable and make table using pandas. Now we have done all things if you want to save table then save it in csv file otherwise just print it.

Here's the code implementation of all above things:

from bs4 import BeautifulSoup
import requests
import pandas as pd

page = requests.get('https://www.polskawliczbach.pl/wies_Baniocha')
soup = BeautifulSoup(page.content, 'lxml')

lis=soup.find_all("ul",class_="list-group row")[1].find_all("li")[1:-1]
dic={"name":[],"value":[]}
for li in lis:
    try:
        dic["name"].append(li.find(text=True,recursive=False).strip())
        dic["value"].append(li.find("span").text.replace(" ",""))
        print(li.find(text=True,recursive=False).strip(),li.find("span").text.replace(" ",""))
    except:
        pass

df=pd.DataFrame(dic)

print(df)
# If you want to save this as file then uncomment following line:
# df.to_csv("<FILENAME>.csv")

And additionally if you want to scrape all then "categories", I don't understand that language so,I don't know which is useful and which is not but anyway here's the code, you can just change this part of above code:

soup = BeautifulSoup(page.content, 'lxml')

dic={"name":[],"value":[]}
lis=soup.find_all("ul",class_="list-group row")
for li in lis:
    a=li.find_all("li")[1:-1]
    for b in a:
        error=0
        try:
            print(b.find(text=True,recursive=False).strip(),"\t",b.find("span").text.replace(" ","").replace(",",""))
            dic["name"].append(b.find(text=True,recursive=False).strip())
            dic["value"].append(b.find("span").text.replace(" ","").replace(",",""))
        except Exception as e:
            pass

df=pd.DataFrame(dic)

Upvotes: 1

Bhavya Parikh
Bhavya Parikh

Reputation: 3400

Find main tag by specific class and from it find all li tag

main_data=data.find("ul", class_="list-group").find_all("li")[1:-1]
names=[]
values=[]
main_values=[]
for i in main_data:
    values.append(i.find("span").get_text())    
    names.append(i.find(text=True,recursive=False))
main_values.append(values)

For table representation use pandas module

import pandas as pd
df=pd.DataFrame(columns=names,data=main_values)
df

Output:

Liczba mieszkańców (2011)   Kod pocztowy    Numer kierunkowy
 0  1 935                  05-532           (+48) 22

Upvotes: 1

Related Questions