jack ryan
jack ryan

Reputation: 59

Create a Dataframe from HTML

I am trying to read a table from a web-page. Generally, my company has strict authentication policies restricting us in the way we can scrape the data. But the following code is how I am trying to use to do the same

from urllib.request import urlopen
from requests_kerberos import HTTPKerberosAuth, OPTIONAL
import os
import lxml.html as LH
import requests
import pandas as pd

cert = r"C:\\Users\\name\\Desktop\\cacert.pem"
os.environ["REQUESTS_CA_BUNDLE"] = cert
kerberos = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
session = requests.Session()

link = 'weblink'
data=session.get(link,auth=kerberos,verify=False).content.decode("latin-1")

And that leaves me with the entire HTML of the webpage in "data". How do I convert this into a dataframe?

Note : I couldn't provide the weblink due to privacy concerns.. I was just wondering if there was a general way which I can use to tackle this situation.

Upvotes: 1

Views: 704

Answers (1)

caxcaxcoatl
caxcaxcoatl

Reputation: 8970

It looks like you're looking for something like this, using Beautifulsoup?

From there, you'll have to create the data frame itself, but you will have passed the 'procedure to convert the HTML into' a data structure step. (that is, read the HTML table into a list or dictionary, and then transform it into a dataframe)

Edit 1

Actually, you can use Pandas' read_html. You might need Beautifulsoup still to get exactly what you want, but depending on how the source HTML looks like, it might be enough alone.

Upvotes: 1

Related Questions