Pandasonsleds
Pandasonsleds

Reputation: 15

How to read a public Google Doc using the Google Doc API?

I am attempting to read a Google doc using the Google docs API. However, for public Google docs files I don't have access to the Document ID, and cannot acquire said ID from the author. In particular I am attempting to read this document https://docs.google.com/document/u/0/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub?pli=1 . Is there a way for me to read this file using Google docs API, or should I look into a different method such as Beautiful Soup?

The code I am using comes from https://developers.google.com/docs/api/quickstart/python.

import os.path

from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

# If modifying these scopes, delete the file token.json.
SCOPES = ["https://www.googleapis.com/auth/documents.readonly"]

# The ID of a sample document.
DOCUMENT_ID = "195j9eDD3ccgjQRttHhJPymLJUCOUjs-jmwTrekvdjFE"


def main():
  """Shows basic usage of the Docs API.
  Prints the title of a sample document.
  """
  creds = None
  # The file token.json stores the user's access and refresh tokens, and is
  # created automatically when the authorization flow completes for the first
  # time.
  if os.path.exists("token.json"):
    creds = Credentials.from_authorized_user_file("token.json", SCOPES)
  # If there are no (valid) credentials available, let the user log in.
  if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
      creds.refresh(Request())
    else:
      flow = InstalledAppFlow.from_client_secrets_file(
          "credentials.json", SCOPES
      )
      creds = flow.run_local_server(port=0)
    # Save the credentials for the next run
    with open("token.json", "w") as token:
      token.write(creds.to_json())

  try:
    service = build("docs", "v1", credentials=creds)

    # Retrieve the documents contents from the Docs service.
    document = service.documents().get(documentId=DOCUMENT_ID).execute()

    print(f"The title of the document is: {document.get('body').get('content')}")
  except HttpError as err:
    print(err)


if __name__ == "__main__":
  main()

Upvotes: 0

Views: 483

Answers (1)

Tanaike
Tanaike

Reputation: 201358

From your question, I understand your situation and the expected results are as follows.

  • You have no Google Document ID. You have only a web-published URL https://docs.google.com/document/u/0/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub?pli=1.
  • You want to retrieve an object from the above URL using document = service.documents().get(documentId=DOCUMENT_ID).execute() of Google Docs API.

In order to retrieve the object using Google Docs API, it is required to use the Google Document ID. But, unfortunately, in the current stage, there are no methods for directly retrieving the Document ID from the web published URL https://docs.google.com/document/u/0/d/e/2PACX-###/pub?pli=1. So, in this case, it is required to use a workaround. The steps of this workaround is as follows.

  1. Download the data from https://docs.google.com/document/u/0/d/e/2PACX-###/pub?pli=1 as HTML.
  2. Upload the HTML to Google Drive as a Google Document.
  3. Retrieve an object using your script document = service.documents().get(documentId=DOCUMENT_ID).execute().

In order to download and upload the data for Google Drive, it is required to use Drive API.

When this step is reflected in your script, it becomes as follows.

Modified script:

In this modified script, the scopes are changed. So, please remove your current token.json and run the script. By this, a new token.json is created by including new scopes.

from __future__ import print_function
import os.path

import io
import requests
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from googleapiclient.http import MediaIoBaseUpload

SCOPES = ["https://www.googleapis.com/auth/documents.readonly",
          "https://www.googleapis.com/auth/drive.file"]


def main():
    """Shows basic usage of the Docs API.
    Prints the title of a sample document.
    """
    creds = None
    # The file token.json stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists("token.json"):
        creds = Credentials.from_authorized_user_file("token.json", SCOPES)
    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                "credentials.json", SCOPES
            )
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open("token.json", "w") as token:
            token.write(creds.to_json())

    try:
        url = "https://docs.google.com/document/u/0/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub?pli=1" # This is from your question.
        filename = "tempDocument"

        # Download HTML from URL.
        res1 = requests.get(url)
        print("Done: download HTML.")

        # Upload HTML as a Google Document
        drive = build("drive", "v3", credentials=creds)
        media = MediaIoBaseUpload(io.BytesIO(res1.content), mimetype='text/html', resumable=True)
        request = drive.files().create(
            media_body=media,
            body={'name': filename, 'mimeType': "application/vnd.google-apps.document"}
        )
        res2 = None
        while res2 is None:
            status, res2 = request.next_chunk()
            if status:
                print("Uploaded %d%%." % int(status.progress() * 100))
        DOCUMENT_ID = res2['id']
        print("Done: upload HTML as a Google Document.")

        # Retrieve object from a Google Document.
        service = build("docs", "v1", credentials=creds)

        # Retrieve the documents contents from the Docs service.
        document = service.documents().get(documentId=DOCUMENT_ID).execute()

        # When you want to skip the header. Please use the following script.
        contentWithoutHeader = document.get('body').get('content')[6:]
        print(contentWithoutHeader)

        # print(f"The title of the document is: {document.get('body').get('content')}")
    except HttpError as err:
        print(err)


if __name__ == "__main__":
    main()
  • When this script is run, I confirmed that the object could be obtained by document = service.documents().get(documentId=DOCUMENT_ID).execute().

Note:

  • In this step, the whole page of the URL is converted to a Google Document. So, the following header is included.

    Published using Google Docs
    Report abuseLearn more
    Coding assessment input data example
    Updated automatically every 5 minutes
    
    
  • In this workaround, the converted Document might not be completely the same with the original Document because the Document is converted from the downloaded HTML. So, this is a workaround. Please be careful about this.

Reference:

Upvotes: 1

Related Questions