Reputation: 1823
I'd like to extraction the mean, max, min and sd extraction inside 5-95 quantiles for the variables B2
,
B3
, B4
, B8
, NDVI
, SAVI
, SIPI
, SR
, RGI
, TVI
, MSR
, PRI
, GNDVI
, PSRI
, GCI
aggregate by AGE
and ESPAC
variables inside a CMPC table:
My CMPC SQL table ([PROJECT_ID].spectra_calibration.CMPC
) create inside BigQuery:
Rows: 55.310
Columns: 27
Database: BigQueryConnection
$ x <dbl> -52.5502, -52.5501, -52.5501, -52.5501, -52.5501, -52.5500, -52.5500, -52.5500, -52.5500, -52.5500,~
$ y <dbl> -30.8295, -30.8297, -30.8296, -30.8295, -30.8294, -30.8298, -30.8297, -30.8296, -30.8295, -30.8294,~
$ stand <chr> "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO0~
$ date <chr> "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "~
$ B2 <dbl> 213, 205, 181, 207, 216, 205, 165, 161, 173, 182, 181, 259, 227, 190, 153, 147, 160, 164, 194, 210,~
$ B3 <dbl> 361.0, 362.0, 346.0, 352.0, 369.0, 330.0, 290.0, 326.0, 334.0, 332.0, 325.0, 375.0, 352.0, 307.0, 2~
$ B4 <dbl> 227.0, 233.0, 198.0, 207.0, 209.0, 227.0, 178.0, 164.0, 180.0, 207.0, 209.0, 267.0, 269.0, 194.0, 1~
$ B8 <dbl> 3033.0, 3307.0, 3322.0, 3232.0, 3241.0, 3065.0, 3306.0, 3422.0, 3427.0, 3392.0, 3165.0, 3206.0, 298~
$ NDVI <dbl> 0.86074, 0.86836, 0.88750, 0.87962, 0.87884, 0.86209, 0.89782, 0.90853, 0.90019, 0.88497, 0.87611, ~
$ SAVI <dbl> 4549.379, 4960.386, 4982.905, 4847.897, 4861.397, 4597.380, 4958.915, 5132.925, 5140.417, 5087.903,~
$ SIPI <dbl> 1.00499, 1.00911, 1.00544, 1.00000, 0.99769, 1.00775, 1.00416, 1.00092, 1.00216, 1.00785, 1.00947, ~
$ SR <dbl> 13.36123, 14.19313, 16.77778, 15.61353, 15.50718, 13.50220, 18.57303, 20.86585, 19.03889, 16.38647,~
$ RGI <dbl> 0.62881, 0.64365, 0.57225, 0.58807, 0.56640, 0.68788, 0.61379, 0.50307, 0.53892, 0.62349, 0.64308, ~
$ TVI <int> 173720, 189600, 193360, 187300, 188320, 174400, 192160, 201960, 200980, 196100, 182000, 180660, 166~
$ MSR <dbl> 3.65530, 3.76738, 4.09607, 3.95140, 3.93792, 3.67453, 4.30964, 4.56792, 4.36336, 4.04802, 3.89147, ~
$ PRI <dbl> -0.25784, -0.27690, -0.31309, -0.25939, -0.26154, -0.23364, -0.27473, -0.33881, -0.31755, -0.29183,~
$ GNDVI <dbl> 0.78727, 0.80267, 0.81134, 0.80357, 0.79557, 0.80560, 0.83871, 0.82604, 0.82239, 0.82170, 0.81375, ~
$ PSRI <dbl> -0.04418, -0.03901, -0.04455, -0.04486, -0.04937, -0.03361, -0.03388, -0.04734, -0.04494, -0.03685,~
$ GCI <dbl> 7.40166, 8.13536, 8.60116, 8.18182, 7.78320, 8.28788, 10.40000, 9.49693, 9.26048, 9.21687, 8.73846,~
$ ID_PROJETO <int> 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245,~
$ PROJETO <chr> "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", ~
$ CD_TALHAO <chr> "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "00~
$ DATA_PLANT <chr> "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "~
$ ESPECIE <chr> "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", ~
$ ESPAC <chr> "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.~
$ AGE_1 <dbl> 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.~
$ AGE <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,~
Reference to a sample table in CSV: https://raw.githubusercontent.com/Leprechault/trash/main/my_ds_CSV.csv
My file schema is:
x FLOAT NULLABLE
y FLOAT NULLABLE
stand STRING NULLABLE
date STRING NULLABLE
B2 FLOAT NULLABLE
B3 FLOAT NULLABLE
B4 FLOAT NULLABLE
B8 FLOAT NULLABLE
NDVI FLOAT NULLABLE
SAVI FLOAT NULLABLE
SIPI FLOAT NULLABLE
SR FLOAT NULLABLE
RGI FLOAT NULLABLE
TVI INTEGER NULLABLE
MSR FLOAT NULLABLE
PRI FLOAT NULLABLE
GNDVI FLOAT NULLABLE
PSRI FLOAT NULLABLE
GCI FLOAT NULLABLE
ID_PROJETO INTEGER NULLABLE
PROJETO STRING NULLABLE
CD_TALHAO STRING NULLABLE
DATA_PLANT STRING NULLABLE
ESPECIE STRING NULLABLE
ESPAC STRING NULLABLE
AGE_1 FLOAT NULLABLE
AGE INTEGER NULLABLE
I try to do the query for just one variable (B2
) for the test and the desirable query is something like:
SELECT DISTINCT AGE, ESPAC
,PERCENTILE_DISC(B2,0.05) OVER(PARTITION BY AGE, ESPAC) AS P05_B2
,PERCENTILE_DISC(B2,0.95) OVER(PARTITION BY AGE, ESPAC) AS P95_B2
,MIN(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_min
,AVG(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_mean
,MAX(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_max
,stddev(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_sd
FROM `[PROJECT_ID].spectra_calibration.CMPC`
ORDER BY AGE, ESPAC
The basic idea is a final result calculated (MAX, MIN, SD and AVG) with the B2 values only with values > P05_B2 and < P95_B2. I'd like something with B2 > P05_B2 & B2 < P95_B2 conditions. My desirable output is:
# AGE ESPAC B2_mean B2_max B2_min B2_sd
# 1 -2 4X1.85 125. 175 75 14.2
# 2 -1 4X1.85 153. 300 67 34.0
# 3 0 4X1.85 419. 928. 71 274.
# 4 1 4X1.85 344. 683 129 83.4
# 5 11 3.5x2.14 137. 259 70 29.8
# 6 12 3.5x2.14 150. 298 67.5 23.6
# 7 13 3.5x2.14 130. 302 70 35.3
# ...
Please any help with this query construction?
Upvotes: 0
Views: 1346
Reputation: 1552
The expressions (
MIN(B2 > P05_B2 & B2 < P95_B2) ,
AVG(B2 > P05_B2 & B2 < P95_B2),
MAX(B2 > P05_B2 & B2 < P95_B2)
)
with operators like < , > will return a Binary value i.e True/False rather than providing the min, max and avg for the expression as per this doc.
For the functions like MIN(), MAX(),AVG(), STDDEV(), you need to specify a column so that the functions will scan the column and will provide the output.
For example :
Select MIN(AGE) as minmum_age from my-project.dataset2.tab1;
Here the MIN() function will scan through the AGE column and will provide the minimum value in that column i.e -2
The sample Query that you have provided throws errors when running in BigQuery.
Error:
For reference you can check the below modified query and output also.
Query :
select DISTINCT AGE, ESPAC,P05_B2 ,
MIN(B2 > P05_B2 AND B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS MIN_B2,
MAX(B2 > P05_B2 AND B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS MAX_B2
FROM(
SELECT DISTINCT AGE, ESPAC,B2
,PERCENTILE_DISC(B2,0.05) OVER(PARTITION BY AGE, ESPAC) AS P05_B2
,PERCENTILE_DISC(B2,0.95) OVER(PARTITION BY AGE, ESPAC) AS P95_B2
FROM `my-project.dataset2.tab1`
ORDER BY AGE, ESPAC
)
Output :
As per your requirement, I tried writing a query with your dataset and I am getting results similar to your output.
Could you try the same on your end and let me know if this works for you.
Query :
SELECT AGE,ESPAC,AVG (B2) AS B2_mean,
MAX(B2) AS B2_max,
MIN(B2) AS B2_min,
STDDEV(B2) AS B2_sd FROM (
SELECT *,
FROM (
SELECT *,
PERCENTILE_DISC(B2, 0.05) OVER () AS P05_B2,
PERCENTILE_DISC(B2, 0.95) OVER () AS P95_B2,
FROM `my-project.dataset2.tab1`
ORDER BY AGE,ESPAC ) WHERE B2 > P05_B2 AND B2 < P95_B2
) GROUP BY AGE,ESPAC ORDER BY AGE,ESPAC
Output :
Please note that the output presented above is based on the sample CSV file that is provided. The aggregated values might change based on the dataset.
Upvotes: 1