nrad
nrad

Reputation: 93

Joining data from two sources using bigquery

Can anyone please check whether below code is correct? In cte_1, I’m taking all dimensions and metrics from t1 excpet value1, value2, value3. In cte_2, I’m finding the unique row number for t2. In cte_3, I’m taking all distinct dimensions and metrics using join on two keys such as Date, and Ad. In cte_4, I’m taking the values for only row number 1. I’m getting sum(value1),sum(value2),sum(value3) correct ,but sum(value4) is incorrect

WITH cte_1 AS
(SELECT *except(value1, value2, value3) FROM t1 where Date >"2020-02-16" and Publisher ="fb")
-- Find unique row number from t2--
,cte_2 as(
SELECT  ROW_NUMBER() OVER(ORDER BY Date) distinct_row_number, * FROM t2

,cte_3 as
(SELECT cte_2.*,cte_1.*except(Date) FROM cte_2 join cte_1  
on cte_2.Date = cte_1. Date 
and cte_2.Ad= cte_1.Ad)) 

,cte_4 AS (
(SELECT *
FROM
(
SELECT *,
row_number() OVER (PARTITION BY distinct_row_number ORDER BY Date) as rn 
FROM cte_3 ) T
where rn = 1 ))

select  sum(value1),sum(value2),sum(value3),sum(value4) from cte_4

Please see the sample table below: enter image description here

Upvotes: 0

Views: 328

Answers (1)

Alexandre Moraes
Alexandre Moraes

Reputation: 4032

Whilst your data does not seem compliant with the query you shared, since it is lacking the field named Ad and other fields have different names, such as Date and ReportDate, I was able to identify some issues and propose improvements.

First, within your temp table cte_1, you are only using a filter in the WHERE clause, you could use it within your from statement in your last step, such as :

SELECT * FROM (SELECT field1,field2,field3 FROM t1 WHERE Date > DATE(2020,02,16) )

Second, in cte_2, you need to select all the columns you will need from the table t2. Otherwise, your table will have only the row number and it won't be possible to join it with other tables, once it does not provide any other information. Thus, if you need the row number, you select it together with the other columns, which it has to include your primary key if you will perform any join in the future. The syntax would be as follows:

SELECT field1, field2, ROW_NUMBER() OVER(ORDER BY Date) FROM t2 

Third, in cte_3, I assume you want to perform an INNER JOIN. Thus, you need to make sure that the primary keys are present in both tables, in your case Date and Ad, which I could not find within your data. Furthermore, you can not have duplicated names when joining two tables and selecting all the columns. For example, in your case you have Brand, value 1, value 2 and value 3 in both tables, it will cause an error. Thus, you need to specify where these fields should come from by selecting one by one or the using a EXCEPT clause.

Finally, in cte_4 and your final select could be together in one step. Basically, you are selecting only one row of data ordered by Date. Then summing the fields value 1, value 2 and value 3 individually based on the partition by date. Moreover, you are not selecting any identifier for the sum, which means that your table will have only the final sums. In general, when peforming a aggregation, such as SUM(), the primary key(s) is selected as well. Lastly, this step could have been performed in one step such as follows, using only the data from t2:

SELECT ReportDate, Brand, sum(value1) as sum_1,sum(value2)  as sum_1,sum(value3)  as sum_1, sum(value4)  as sum_1 FROM (SELECT t2.*, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY Date) as rn t2)
WHERE rn=1
GROUP BY ReportDate, Brand

UPDATE: With your explanation in the comment section. I was able to created a more specific query. The fields ReportDate,Brand,Portfolio,Campaign and value1,value2,value3 are from t2. Whilst value4 is from t1. The sum is made based on the row number equals to 1. For this reason, the tables t1 and t2 are joined before being using ROW_NUMBER(). Finally, in the last Select statement rn is not selected and the data is aggregated based on ReportDate, Brand, Portfolio and t2.Campaign.

WITH cte_1 AS (
SELECT t2.ReportDate, t2.Brand, t2.Portfolio, t2.Campaign, 
t2.value1, t2.value2, t2.value3, t1.value4 
FROM t2 LEFT JOIN t1 on t2.ReportDate = t1.ReportDate and t1.placement=t2.Ad
),
cte_2 AS(
SELECT *, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY ReportDate) as rn FROM cte_1
)
SELECT ReportDate, Brand, Portfolio, Campaign, SUM(value1) as sum1,  SUM(value2) as sum2,  SUM(value3) as sum3,
 SUM(value4) as sum4
FROM cte_2
WHERE rn=1
GROUP BY 1,2,3,4

Upvotes: 1

Related Questions