elvainch
elvainch

Reputation: 1407

Test yaml great-expectations with Bigquery

I am having troubles testing the yaml of great-expectation to bigquery. I followed the official documentation and got to this code

import os 
import great_expectations as ge 

datasource_yaml = """
name: my_bigquery_datasource
class_name: Datasource
execution_engine:
  class_name: SqlAlchemyExecutionEngine
  connection_string: bigquery://<GCP_PROJECT_NAME>/<BIGQUERY_DATASET>
data_connectors:
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    batch_identifiers:
      - default_identifier_name
  default_inferred_data_connector_name:
    class_name: InferredAssetSqlDataConnector
    include_schema_name: true
"""
context = ge.get_context()

context.test_yaml_config(datasource_yaml)

The code works but it takes soo much time. I did deep debugging and see that the problem is that it wants to retrieve all the datasets of the project in bigquery and all the tables from all datasets. We have over 200 datasets and thousands of tables. I haven't found a way to filter the only dataset that i need or more specifically the table. I thought the connection_string should do it but doesn't.

In my deep debugging, and got to the inferred_asset_sql_data_connector.py module. I saw that it should filter the schema_name problem is that always comes as None. And don't know how to pass it as the dataset I want.

enter image description here

I followed this guide as well of introspection but getting other errors. enter image description here

If I put the SimpleSqlalchemyDatasource as class_name I get the following error. And I dont know how to initalize the engine for bq in sqlalchemy in the context of greatexpectations.

enter image description here

Upvotes: 2

Views: 1010

Answers (1)

Abhinay
Abhinay

Reputation: 474

default_inferred_data_connector_name

tries to fetch all dataset and table info from bigquery and it will create assets. You can remove the default_inferred_data_connector_name and use

RuntimeBatchRequest

and use query to validate the data.

Regarding authentication issue you can change the

connection_string: bigquery://<GCP_PROJECT_NAME>/<BIGQUERY_DATASET>

to

connection_string: bigquery://<GCP_PROJECT_NAME>/<BIGQUERY_DATASET>?credentials_path=<path_to_credential file >

More info on sql alchemy configuration can be found at https://github.com/googleapis/python-bigquery-sqlalchemy

Upvotes: 2

Related Questions