Reputation: 185
I need a way to manipulate a table in Google BigQuery which can use multiple columns to set the values in another column using pandas. From what I understand bigframes can't operate on more than one column at a time?
I am looking for something that can do the following (not using SQL) and all the processing is done by BigQuery.
import bigframes.pandas as bpd
@bpd.remote_function([bool],
bool,
bigquery_connection=client.project,
cloud_function_service_account=credentials.signer_email)
def calculate_time_since_trades(self):
table['time_sum'] = table['time1'] + table['time2']
Upvotes: 1
Views: 498
Reputation: 14776
BigQuery DataFrames (bigframes) can combine columns in several ways.
First, I assume you have a DataFrame object like the following:
import bigframes
bigframes.__version__
# 1.7.0
import bigframes.pandas as bpd
bpd.options.bigquery.project = "my-project-id"
df = bpd.read_gbq_table(
"my-project-id.my_dataset.my_table",
)
In order of preference / efficiency:
Just like pandas, BigQuery DataFrames can align objects when used directly with an operator.
In your example, you could do the following:
df['time_sum'] = table['time1'] + table['time2']
If you are doing something that doesn't have a built-in operation and do need to use a remote function, two columns are supported via the Series.combine method.
@bpd.remote_function([int, int], int)
def calculate_time_since_trades(time1, time2):
if time2 is None:
return time1
if time1 is None:
return time2
return time1 + time2
df['time_sum'] = df['time1'].combine(table['time2'], calculate_time_since_trades)
If you need to combine more than 2 columns in a way that's not supported by an existing operator, you can use the DataFrame.apply method with axis=1
and a remote function with input type of Series
. Note: This is currently a Preview feature and the wire format used for requests/responses may change in future, so don't use this with reuse=True
and name
parameters.
@bpd.remote_function(bpd.Series, int)
def calculate_time_since_trades(df):
return df["time1"] + df["time2"]
It's recommended to select just the necessary columns before running apply
because data type support is limited and it lowers the amount of data needed to be serialized.
df["time_sum"] = df[
["time1", "time2"]
].apply(calculate_time_since_trades, axis=1)
Upvotes: 1