WouldLiketoKnowMore
WouldLiketoKnowMore

Reputation: 21

BigQuery, is there a way to search the whole db for a string?

I have a datum that I need to find within the database, for example 'dsfsdfsads'. But, there are 100+ tables and views to search through. I have found numerous queries written for other databases that can find a specific string within the database. Example postings are below. However, I don't see the same for BigQuery. I found this question: Is it possible to do a full text search in all tables in BigQuery?, but this post feels incomplete and the 2 links provided in the answer do not answer my question.

Examples of other database queries capable of finding specific string:

Find a string by searching all tables in SQL Server Management Studio 2008

Search all tables, all columns for a specific value SQL Server

How do I search an SQL Server database for a string?

Upvotes: 1

Views: 2506

Answers (1)

Ksign
Ksign

Reputation: 817

I am not sure why it doesn't suit you to search through your database using a wildcard table like in the post you mentioned. Because I have run this sample query to search through a public dataset and it works just fine.

SELECT *
FROM `bigquery-public-data.baseball.*` b
WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r'Cubs')

I guess it is because one of the limitation is that the wildcard table functionality does not support views.
Do you have a lot of them?

In that case you can use the wildcard only for your tables and filter out the views with _TABLE_SUFFIX or with a less general wildcard (it depends on the names of your views).
In general, with wildcard tables, using _TABLE_SUFFIX can greatly reduce the number of bytes scanned, which reduces the cost of running your queries. So use it also if you suspect of some tables to contain the string more then others.

For the views (or the whole dataset), you could:
• Iterate by calling the BigQuery API using one of the libraries with some multiprocessing module like multiprocessing in Python.
• Iterate by calling the REST API from a bash script.
• Iterate by using the bq command from a bash script.

If you get stuck with the programmatic part, post a new question and add the link here.

EDIT:

Here are two examples for you (bash and python). I tried them both and they work but any comments to help improve are welcome of course.

Python:

  1. Install packages
pip install --upgrade google-cloud-bigquery
pip install multiprocess
  1. Create filename.py. Change YOUR_PROJECT_ID and YOUR_DATASET.

from google.cloud import bigquery
import multiprocessing

def search(dataset_id):
    """
    Lists and filters your dataset to keep only views
    """
    client = bigquery.Client()
    
    tables = client.list_tables(dataset_id)
    views = []
    for table in tables:
        if table.table_type == 'VIEW':
            views.append(table.table_id)
    return views

def query(dataset_id, view):
    """
    Searches for the string in your views and prints the first one it finds.
    You can change or remove 'LIMIT 1' if needed.
    """
    client = bigquery.Client()
    query_job = client.query(
        """
        SELECT *
        FROM {}.{} b
        WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r"true")
        LIMIT 1
        """.format(dataset_id, view)
   )

    results = query_job.result()  # Waits for job to complete.
    for row in results:
        print(row)

if __name__ == '__main__':
    # TODO: Set dataset_id to the ID of the dataset that contains the tables you are listing.
    dataset_id = 'YOUR_PROJECT_ID.YOUR_DATASET'
    views = search(dataset_id)

    processes = []
    for i in views:
        p = multiprocessing.Process(target=query, args=(dataset_id, i))
        p.start()
        processes.append(p)
    for process in processes:
        process.join()

Run python filename.py


Bash:

  1. Install jq (json parser) and test it
sudo apt-get install jq

Test

echo '{ "name":"John", "age":31, "city":"New York" }' | jq .

Output:

{
  "name": "John",
  "age": 31,
  "city": "New York"
}

Reference

  1. Create filename.sh. Change YOUR_PROJECT_ID and YOUR_DATASET.
#!/bin/bash
FILES="bq ls --format prettyjson YOUR_DATASET"
RESULTS=$(eval $FILES)
DETAILS=$(echo "${RESULTS}" | jq -c '.[]')
for d in $DETAILS
do
        ID=$(echo $d | jq -r .tableReference.tableId)
        table_type=$(echo $d | jq -r '.type')
        if [[ $table_type == "VIEW" ]]
          then
                bq query --use_legacy_sql=false \
                'SELECT *
                FROM
                `YOUR_PROJECT_ID`.YOUR_DATASET.'$ID' b
                WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r"true")
                LIMIT 1'
          fi

done

Run bash filename.sh

Upvotes: 2

Related Questions