sdave
sdave

Reputation: 568

Bigquery table to df (dataframe) in a cloud Function

I have a bigquery table - I would like to extraxt it into a pandas dataframe inside cloud function and then do some changes in the header file and later save it into Cloud storage. Unfortunately my function is not working, can anyone see what could be the issue. Do i need to use big query extract job or my idea is also valid?

import base64
import pandas as pd
from google.cloud import bigquery


def extract_partial_return(event, context):
    client = bigquery.Client()
    bucket_name = "abc_test"
    project = "bq_project"
    dataset_id = "bq_dataset"
    table_id = "Partial_Return_Table"

    sql = """
    SELECT * FROM `bq_project.bq_dataset.Partial_Return_Table`
    """
    # Running the query and putting the results directly into a df
    df = client.query(sql).to_dataframe()
    df.columns = ["ga:t_Id", "ga:product", "ga:quantity"]

    destination_uri = (
        "gs://abc_test/Exports/Partial_Return_Table.csv"
    )
    df.to_csv(destination_uri)

My requirement.txt looks like this

# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
pyarrow

Upvotes: 1

Views: 1305

Answers (2)

sdave
sdave

Reputation: 568

pyarrow library is the key here

import base64
import pandas as pd
from google.cloud import bigquery

def extract_partial_return(event, context):
    client = bigquery.Client()
    sql = """
    SELECT * FROM `bq_project.bq_dataset.Partial_Return_Table`
    """
    # Running the query and putting the results directly into a df
    df = client.query(sql).to_dataframe()
    df.columns = ["ga:t_Id", "ga:product", "ga:quantity"]

    destination_uri = ("gs://abc_test/Exports/Partial_Return_Table.csv")
    df.to_csv(destination_uri)

requirement.txt

pandas
fsspec
gcsfs
google-cloud-bigquery
google-cloud-storage
pyarrow

Upvotes: 2

Malaman
Malaman

Reputation: 314

First of all, the idea to do this is okay, however you may want to use other product such as Dataflow or Dataproc which are designed for these purposes.

On the other hand, in order to complete the idea that you have now, you should take care of the way you are constructing the SQL command because you are not using the variables created for the project, dataset, etc. The same issue happens on the bucket. Moreover, I think that you are lacking a couple of dependencies (fsspec and gcsfs).

Manuel

Upvotes: 1

Related Questions