jack gell
jack gell

Reputation: 185

pandas big frames manipulating multiple columns in googles bigquery

I need a way to manipulate a table in Google BigQuery which can use multiple columns to set the values in another column using pandas. From what I understand bigframes can't operate on more than one column at a time?

I am looking for something that can do the following (not using SQL) and all the processing is done by BigQuery.

import bigframes.pandas as bpd

@bpd.remote_function([bool],
                     bool,
                     bigquery_connection=client.project,
                     cloud_function_service_account=credentials.signer_email)
def calculate_time_since_trades(self):
    
    table['time_sum'] = table['time1'] + table['time2']

Upvotes: 1

Views: 498

Answers (1)

Tim Swena
Tim Swena

Reputation: 14776

BigQuery DataFrames (bigframes) can combine columns in several ways.

First, I assume you have a DataFrame object like the following:

import bigframes

bigframes.__version__
# 1.7.0

import bigframes.pandas as bpd

bpd.options.bigquery.project = "my-project-id"

df = bpd.read_gbq_table(
    "my-project-id.my_dataset.my_table",
)

In order of preference / efficiency:

  1. Just like pandas, BigQuery DataFrames can align objects when used directly with an operator.

    In your example, you could do the following:

    df['time_sum'] = table['time1'] + table['time2']
    
  2. If you are doing something that doesn't have a built-in operation and do need to use a remote function, two columns are supported via the Series.combine method.

    @bpd.remote_function([int, int], int)
    def calculate_time_since_trades(time1, time2):
        if time2 is None:
            return time1
        if time1 is None:
            return time2
        return time1 + time2
    
    df['time_sum'] = df['time1'].combine(table['time2'], calculate_time_since_trades)
    
  3. If you need to combine more than 2 columns in a way that's not supported by an existing operator, you can use the DataFrame.apply method with axis=1 and a remote function with input type of Series. Note: This is currently a Preview feature and the wire format used for requests/responses may change in future, so don't use this with reuse=True and name parameters.

    @bpd.remote_function(bpd.Series, int)
    def calculate_time_since_trades(df):
        return df["time1"] + df["time2"]
    

    It's recommended to select just the necessary columns before running apply because data type support is limited and it lowers the amount of data needed to be serialized.

    df["time_sum"] = df[
        ["time1", "time2"]
    ].apply(calculate_time_since_trades, axis=1)
    

Upvotes: 1

Related Questions