python_coder_
python_coder_

Reputation: 103

Creating a Data Frame from the URL of a website using python

I have been working on web scraping from python and I want to create a dataframe from a URL of a website. The data format of the file is .ods. I have tried downloading the .ods file to the computer using beautifulsoup and then reading it to create a dataframe. The file itself contains a header that has to be removed. I achieved successful result through this method and my code is attached below.

from pandas_ods_reader import read_ods
import bs4            
import requests
import pandas as pd

url = "https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic"
html = requests.get(url)
soup = bs4.BeautifulSoup(html.text, "html.parser")
i=0
for link in soup.find_all('a', href=True):
    i+=1
    href = link['href']

    if any(href.endswith(x) for x in ['.ods']):
        #print(href)
        file_data = requests.get(href).content
        with open('data.ods', "wb") as file:
            file.write(file_data)

df = read_ods('data.ods', 1, headers=False)[6:-44]
df.index = range(0, 346)
df.columns = df.iloc[0]
df.drop(0)
df

Now I want to figure out whether this can be achieved directly without downloading the .ods file. If there is a way to directly create a dataframe from the .ods file available in the webpage, that would serve my purpose. Please suggest a suitable code if this is achievable

Upvotes: 1

Views: 669

Answers (2)

I've no idea why you pickup all anchor tags ? a, BTW, I'll imagine that you will apply the same logic over multiple pages where there can be one or more of files.

So the right approach is to append to set where there can't be any duplicate urls.

And then you can load the url within pandas read_excel

I don't want to parse the table with skipping number of lines from the start and end. that's open for you to create a function to parse for example. the table if the row contain date on first column for example.

import requests
from bs4 import BeautifulSoup
import pandas as pd


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    links = set(x['href'] for x in soup.select('a[href$=ods]'))
    for link in links:
        df = pd.read_excel(link)
        print(df)
        #df.to_csv('data.csv',index= False)


if __name__ == "__main__":
    main("https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic")
import requests
from bs4 import BeautifulSoup
import pandas as pd


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    links = set(x['href'] for x in soup.select('a[href$=ods]'))
    for link in links:
        df = pd.read_excel(link)
        print(df)
        #df.to_csv('data.csv',index= False)


if __name__ == "__main__":
    main("https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic")

Upvotes: 1

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

Panda's read_excel allows you to load dataframes directly from urls. You can also use it to load odf files after installing pip install odfpy:

import bs4            
import requests
import pandas as pd

url = "https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic"
html = requests.get(url)
soup = bs4.BeautifulSoup(html.text, "html.parser")
i=0
for link in soup.find_all('a', href=True):
    i+=1
    href = link['href']
    #print(href)
    if any(href.endswith(x) for x in ['.ods']):
        print(href)
        df = pd.read_excel(href, header=None)[6:-44]
        df.index = range(0, 346)
        df.columns = df.iloc[0]
        df.drop(0)
df

Upvotes: 1

Related Questions