Leprechault
Leprechault

Reputation: 1823

Standard deviation, mean, max and min extraction just inside 5-95 quantiles values

I'd like to extraction the mean, max, min and sd extraction inside 5-95 quantiles for the variables B2, B3, B4, B8, NDVI, SAVI, SIPI, SR, RGI, TVI, MSR, PRI, GNDVI, PSRI, GCI aggregate by AGE and ESPAC variables inside a CMPC table:

My CMPC SQL table ([PROJECT_ID].spectra_calibration.CMPC) create inside BigQuery:

Rows: 55.310
Columns: 27
Database: BigQueryConnection
$ x          <dbl> -52.5502, -52.5501, -52.5501, -52.5501, -52.5501, -52.5500, -52.5500, -52.5500, -52.5500, -52.5500,~
$ y          <dbl> -30.8295, -30.8297, -30.8296, -30.8295, -30.8294, -30.8298, -30.8297, -30.8296, -30.8295, -30.8294,~
$ stand      <chr> "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO0~
$ date       <chr> "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "~
$ B2         <dbl> 213, 205, 181, 207, 216, 205, 165, 161, 173, 182, 181, 259, 227, 190, 153, 147, 160, 164, 194, 210,~
$ B3         <dbl> 361.0, 362.0, 346.0, 352.0, 369.0, 330.0, 290.0, 326.0, 334.0, 332.0, 325.0, 375.0, 352.0, 307.0, 2~
$ B4         <dbl> 227.0, 233.0, 198.0, 207.0, 209.0, 227.0, 178.0, 164.0, 180.0, 207.0, 209.0, 267.0, 269.0, 194.0, 1~
$ B8         <dbl> 3033.0, 3307.0, 3322.0, 3232.0, 3241.0, 3065.0, 3306.0, 3422.0, 3427.0, 3392.0, 3165.0, 3206.0, 298~
$ NDVI       <dbl> 0.86074, 0.86836, 0.88750, 0.87962, 0.87884, 0.86209, 0.89782, 0.90853, 0.90019, 0.88497, 0.87611, ~
$ SAVI       <dbl> 4549.379, 4960.386, 4982.905, 4847.897, 4861.397, 4597.380, 4958.915, 5132.925, 5140.417, 5087.903,~
$ SIPI       <dbl> 1.00499, 1.00911, 1.00544, 1.00000, 0.99769, 1.00775, 1.00416, 1.00092, 1.00216, 1.00785, 1.00947, ~
$ SR         <dbl> 13.36123, 14.19313, 16.77778, 15.61353, 15.50718, 13.50220, 18.57303, 20.86585, 19.03889, 16.38647,~
$ RGI        <dbl> 0.62881, 0.64365, 0.57225, 0.58807, 0.56640, 0.68788, 0.61379, 0.50307, 0.53892, 0.62349, 0.64308, ~
$ TVI        <int> 173720, 189600, 193360, 187300, 188320, 174400, 192160, 201960, 200980, 196100, 182000, 180660, 166~
$ MSR        <dbl> 3.65530, 3.76738, 4.09607, 3.95140, 3.93792, 3.67453, 4.30964, 4.56792, 4.36336, 4.04802, 3.89147, ~
$ PRI        <dbl> -0.25784, -0.27690, -0.31309, -0.25939, -0.26154, -0.23364, -0.27473, -0.33881, -0.31755, -0.29183,~
$ GNDVI      <dbl> 0.78727, 0.80267, 0.81134, 0.80357, 0.79557, 0.80560, 0.83871, 0.82604, 0.82239, 0.82170, 0.81375, ~
$ PSRI       <dbl> -0.04418, -0.03901, -0.04455, -0.04486, -0.04937, -0.03361, -0.03388, -0.04734, -0.04494, -0.03685,~
$ GCI        <dbl> 7.40166, 8.13536, 8.60116, 8.18182, 7.78320, 8.28788, 10.40000, 9.49693, 9.26048, 9.21687, 8.73846,~
$ ID_PROJETO <int> 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245,~
$ PROJETO    <chr> "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", ~
$ CD_TALHAO  <chr> "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "00~
$ DATA_PLANT <chr> "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "~
$ ESPECIE    <chr> "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", ~
$ ESPAC      <chr> "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.~
$ AGE_1      <dbl> 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.~
$ AGE        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,~

Reference to a sample table in CSV: https://raw.githubusercontent.com/Leprechault/trash/main/my_ds_CSV.csv

My file schema is:

x   FLOAT   NULLABLE    
y   FLOAT   NULLABLE    
stand   STRING  NULLABLE    
date    STRING  NULLABLE    
B2  FLOAT   NULLABLE    
B3  FLOAT   NULLABLE    
B4  FLOAT   NULLABLE    
B8  FLOAT   NULLABLE    
NDVI    FLOAT   NULLABLE    
SAVI    FLOAT   NULLABLE    
SIPI    FLOAT   NULLABLE    
SR  FLOAT   NULLABLE    
RGI FLOAT   NULLABLE    
TVI INTEGER NULLABLE    
MSR FLOAT   NULLABLE    
PRI FLOAT   NULLABLE    
GNDVI   FLOAT   NULLABLE    
PSRI    FLOAT   NULLABLE    
GCI FLOAT   NULLABLE    
ID_PROJETO  INTEGER NULLABLE    
PROJETO STRING  NULLABLE    
CD_TALHAO   STRING  NULLABLE    
DATA_PLANT  STRING  NULLABLE    
ESPECIE STRING  NULLABLE    
ESPAC   STRING  NULLABLE    
AGE_1   FLOAT   NULLABLE    
AGE INTEGER NULLABLE 

I try to do the query for just one variable (B2) for the test and the desirable query is something like:

SELECT DISTINCT AGE, ESPAC
,PERCENTILE_DISC(B2,0.05) OVER(PARTITION BY AGE, ESPAC) AS P05_B2
,PERCENTILE_DISC(B2,0.95) OVER(PARTITION BY AGE, ESPAC) AS P95_B2
,MIN(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_min
,AVG(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_mean
,MAX(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_max
,stddev(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_sd
FROM `[PROJECT_ID].spectra_calibration.CMPC`
ORDER BY AGE, ESPAC

The basic idea is a final result calculated (MAX, MIN, SD and AVG) with the B2 values only with values > P05_B2 and < P95_B2. I'd like something with B2 > P05_B2 & B2 < P95_B2 conditions. My desirable output is:

#     AGE ESPAC    B2_mean B2_max B2_min B2_sd 
# 1    -2 4X1.85      125.   175    75    14.2    
# 2    -1 4X1.85      153.   300    67    34.0   
# 3     0 4X1.85      419.   928.   71   274.     
# 4     1 4X1.85      344.   683   129    83.4    
# 5    11 3.5x2.14    137.   259    70    29.8    
# 6    12 3.5x2.14    150.   298    67.5  23.6    
# 7    13 3.5x2.14    130.   302    70    35.3    
# ...

Please any help with this query construction?

Upvotes: 0

Views: 1346

Answers (1)

Sandeep Mohanty
Sandeep Mohanty

Reputation: 1552

The expressions (

MIN(B2 > P05_B2 & B2 < P95_B2) ,

AVG(B2 > P05_B2 & B2 < P95_B2),
 
MAX(B2 > P05_B2 & B2 < P95_B2)

)

with operators like < , > will return a Binary value i.e True/False rather than providing the min, max and avg for the expression as per this doc.

For the functions like MIN(), MAX(),AVG(), STDDEV(), you need to specify a column so that the functions will scan the column and will provide the output.

For example :

Select MIN(AGE) as minmum_age from my-project.dataset2.tab1;

Here the MIN() function will scan through the AGE column and will provide the minimum value in that column i.e -2

The sample Query that you have provided throws errors when running in BigQuery.

Error:

enter image description here

For reference you can check the below modified query and output also.

Query :

select DISTINCT AGE, ESPAC,P05_B2 ,
MIN(B2 > P05_B2 AND B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS MIN_B2,
MAX(B2 > P05_B2 AND B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS MAX_B2
FROM(
SELECT DISTINCT AGE, ESPAC,B2
,PERCENTILE_DISC(B2,0.05) OVER(PARTITION BY AGE, ESPAC) AS P05_B2
,PERCENTILE_DISC(B2,0.95) OVER(PARTITION BY AGE, ESPAC) AS P95_B2
FROM `my-project.dataset2.tab1`
ORDER BY AGE, ESPAC
)

Output :

enter image description here

As per your requirement, I tried writing a query with your dataset and I am getting results similar to your output.

Could you try the same on your end and let me know if this works for you.

Query :

SELECT AGE,ESPAC,AVG (B2) AS B2_mean,

MAX(B2)  AS B2_max,

MIN(B2)  AS B2_min,

STDDEV(B2)  AS B2_sd  FROM (

SELECT *,

FROM (

SELECT *,

PERCENTILE_DISC(B2, 0.05)  OVER () AS P05_B2,

PERCENTILE_DISC(B2, 0.95)  OVER () AS P95_B2,

FROM `my-project.dataset2.tab1` 
ORDER BY AGE,ESPAC ) WHERE B2 > P05_B2 AND B2 < P95_B2

) GROUP BY AGE,ESPAC ORDER BY AGE,ESPAC

Output :

enter image description here

Please note that the output presented above is based on the sample CSV file that is provided. The aggregated values might change based on the dataset.

Upvotes: 1

Related Questions