Rbe
Rbe

Reputation: 11

Parse Table data from a public google doc using Python

I have a URL to a public google doc which is published (It says published using Google Docs at the top). It has a URL in the form of https://docs.google.com/document/d/e/<Some long random string, I think the ID of the document>/pub

Please note that this is not a spreadsheet (Google sheet), but a doc. This doc contains some explanatory text at the beginning and then a table I need to read. How do I accomplish this using Python and only the URL? I don't have much knowledge of Google APIs, etc. I don't want the text at the beginning, but only the table data in some popular format like a Pandas dataframe, etc. The table data could also contain Unicode characters.

I tried following some steps in the Docs API quickstart guide (https://developers.google.com/docs/api/quickstart/python). After I followed the instructions, the given code (copy-pasted as it is) worked. Still, it involved some steps about creating a new Google project, enabling the API, configuring the OAuth screen and then authorizing credentials for a desktop application. However, when I replaced the example document ID (the string inside the quotes

DOCUMENT_ID = "195j9eDD3ccgjQRttHhJPymLJUCOUjs-jmwTrekvdjFE")

with the ID of the document I need to access, I got this error:

<HttpError 404 when requesting https://docs.googleapis.com/v1/documents/<MY_GIVEN_DOCUMENT_ID>?alt=json returned "Requested entity was not found.". Details: "Requested entity was not found.">

I just want a simple solution which uses only the published doc's URL, since the doc is already public. I don't want to go through some authentication steps. I need that even if I send the code to someone else, they can also run the same code and get the same results without any authentication issues. Please help me with this.

Upvotes: 1

Views: 1211

Answers (1)

Sam
Sam

Reputation: 33

I was faced with this same exact problem. I'm going to guess you and I were probably doing the same application challenge!

Using requests, I was able to pull down the raw HTML response from calling the page, then using BeautifulSoup I was able to turn it into a workable, parse-able object:

# Make request
html_response = requests.get(url=url)

# Parse html into a BeautifulSoup object
soup = BeautifulSoup(html_response.text, 'html.parser')

# Collect and return the first table (assuming the first table is what you want)
return soup.find('table')

From there, you can parse the table more precisely to pull out the data you want. Here are a couple examples of how you can work with a BeautifulSoup table to get what you need:

I'm refraining from copy-pasting my exact solution because I know others will use this to fill out the same job application challenge, but this gets you everything you need as long as you have a Python foundation.

Upvotes: 0

Related Questions