Reputation: 81
I've created (with help from StackOverflow!) a reasonably complicated SQL query to perform cohort analysis on how purchasing a particular product affects revenue.
I created it to be parameterized, so that the query can be passed ${PRODUCT}
as the product to evaluate and it'll generate the report.
SELECT
DATE_DIFF(PARSE_DATE("%Y-%m", month), PARSE_DATE("%Y-%m", earliest_use), MONTH) as nplus,
CASE WHEN ProductName = '${PRODUCT}' THEN '${PRODUCT}' ELSE 'Other' END as product,
CustomerInfo,
AVG(net_revenue) as avg_net_revenue,
COUNT(DISTINCT CustomerID) as customers
FROM (
SELECT
month,
rev.CustomerID,
ProductName,
customers.CustomerInfo,
SUM(Revenue) AS net_revenue,
MIN(MIN(CASE WHEN ProductName = '${PRODUCT}' THEN month END)) OVER (PARTITION BY rev.CustomerID) AS earliest_use
FROM
RevenueTable as rev,
CustomerTable AS customers
WHERE
customers.CustomerID = rev.CustomerID
AND rev.CustomerID IN (
SELECT DISTINCT CustomerID
FROM RevenueTable
WHERE ProductName = '${PRODUCT}'
)
GROUP BY 1,2,3,4
)
GROUP BY nplus, product, CustomerInfo
HAVING nplus >= -7
ORDER BY nplus
However, I now want to create a generic version that simple iterates this query over ALL products.
Any help would be much appreciated.
Upvotes: 0
Views: 82
Reputation: 400
If you are using CLI to run your parameterized query, you could use an script like:
bq query --format=csv --use_legacy_sql=false 'SELECT product_desc FROM `PROJECT.DATASET.products_list`' | xargs -I {} bq query --use_legacy_sql=false --parameter=product::{} 'SELECT fact_id, product_desc FROM `PROJECT.DATASET.product_fact_table` WHERE product_desc = @product'
In this script, I'm querying the name of the products from a table named "products_list" and then passing each one to the query that get all the operations related to a specific product in the table called "product_fact_table". The result looks like:
On the other hand, if you are using the API or some of the client libraries (Go, Java, Node.js or Python), I think you only need to build a loop according with the programming language you are using.
The BigQuery UI currently doesn't support parameterized queries.
In addition, I would like to share with you an additional alternative using Scripting:
DECLARE count INT64 default 0;
DECLARE target_product STRING;
DECLARE products_array ARRAY<STRING>;
DECLARE number_of_products INT64;
SET number_of_products = (
WITH products AS
(SELECT ARRAY (select product_desc from `PROJECT_ID.DATASET.products_list`) as products_result)
SELECT ARRAY_LENGTH(products_result) AS len
FROM products);
SET products_array = (
SELECT ARRAY_AGG(product_desc)
FROM `PROJECT_ID.DATASET.products_list`
);
LOOP
SET target_product = products_array[OFFSET(count)];
SELECT * FROM `PROJECT_ID.DATASET.product_fact_table` WHERE product_desc=target_product;
SET count = count + 1;
IF count >= number_of_products THEN
LEAVE;
END IF;
END LOOP;
Here, I'm declaring the "count" variable and using it to count the number of iterations processed, the "target_product" variable contains the name of the product on each iteration, the "products_array" is set with all the products from a query and, finally, the "number_of_products" contains the total of products and it is used to limit the number of loop iterations.
Upvotes: 1