pandas big frames manipulating multiple columns in googles bigquery

I need a way to manipulate a table in Google BigQuery which can use multiple columns to set the values in another column using pandas. From what I understand bigframes can't operate on more than one column at a time?

I am looking for something that can do the following (not using SQL) and all the processing is done by BigQuery.

import bigframes.pandas as bpd

@bpd.remote_function([bool],
                     bool,
                     bigquery_connection=client.project,
                     cloud_function_service_account=credentials.signer_email)
def calculate_time_since_trades(self):
    
    table['time_sum'] = table['time1'] + table['time2']

Upvotes: 1

Answers (1)

Tim Swena

Reputation: 14776

BigQuery DataFrames (bigframes) can combine columns in several ways.

First, I assume you have a DataFrame object like the following:

import bigframes

bigframes.__version__
# 1.7.0

import bigframes.pandas as bpd

bpd.options.bigquery.project = "my-project-id"

df = bpd.read_gbq_table(
    "my-project-id.my_dataset.my_table",
)

In order of preference / efficiency:

Just like pandas, BigQuery DataFrames can align objects when used directly with an operator.

In your example, you could do the following:
```
df['time_sum'] = table['time1'] + table['time2']
```

If you are doing something that doesn't have a built-in operation and do need to use a remote function, two columns are supported via the Series.combine method.

@bpd.remote_function([int, int], int)
def calculate_time_since_trades(time1, time2):
    if time2 is None:
        return time1
    if time1 is None:
        return time2
    return time1 + time2

df['time_sum'] = df['time1'].combine(table['time2'], calculate_time_since_trades)

If you need to combine more than 2 columns in a way that's not supported by an existing operator, you can use the DataFrame.apply method with axis=1 and a remote function with input type of Series. Note: This is currently a Preview feature and the wire format used for requests/responses may change in future, so don't use this with reuse=True and name parameters.
```
@bpd.remote_function(bpd.Series, int)
def calculate_time_since_trades(df):
    return df["time1"] + df["time2"]
```
It's recommended to select just the necessary columns before running apply because data type support is limited and it lowers the amount of data needed to be serialized.
```
df["time_sum"] = df[
    ["time1", "time2"]
].apply(calculate_time_since_trades, axis=1)
```

Upvotes: 1

pandas big frames manipulating multiple columns in googles bigquery

Answers (1)

Related Questions