Chris
Chris

Reputation: 3715

Historical weather data BigQuery

I try to get weather data for 7 days prior to a given date and close to some coordinates (lat,lon). Something like 20km radius. If there are multiple stations I'd probably want to average the data grouped by day.

Is there a ways to compute all this directly with BigQuery? For testing I have calculated min and max coordinates and created the following query

SELECT
  *
FROM
  [bigquery-public-data:noaa_gsod.gsod2016] a
JOIN
  [bigquery-public-data:noaa_gsod.stations] b
ON
  a.stn=b.usaf
  AND a.wban=b.wban
WHERE
  (b.lat >= 46.248332
    AND b.lat <= 47.147654)
  AND (b.lon >= 5.689853
    AND b.lon <= 7.001115)
  AND a.mo='03'

I am not very happy with the query yet

This is how I compute the min max coordinates:

maxLat = lat + math.degrees(searchRadius / earthRadius)
minLat = lat - math.degrees(searchRadius / earthRadius)
maxLon = lon + math.degrees(searchRadius / earthRadius) / math.cos(math.radians(lat))
minLon = lon - math.degrees(searchRadius / earthRadius) / math.cos(math.radians(lat))

Upvotes: 1

Views: 373

Answers (2)

Willian Fuks
Willian Fuks

Reputation: 11787

This is the best solution I could come up with:

#standardSQL
CREATE TEMP FUNCTION distance(lat1 FLOAT64, lat2 FLOAT64, lon1 FLOAT64, lon2 FLOAT64) AS((
WITH data AS(
SELECT POW(SIN((ACOS(-1) / 180 * (lat1 -lat2)) / 2), 2) + COS(ACOS(-1) / 180 * (lat1)) * COS(ACOS(-1) / 180 * (lat2)) * POW(SIN((ACOS(-1) / 180 * (lon1 -lon2)) / 2), 2) a
)
SELECT 6371 * 2 * ATAN2(SQRT((SELECT a FROM data)), SQRT(1 - (SELECT a FROM data)))
));

WITH temperature_data AS(
SELECT
  CONCAT(year, mo, da) date,
  temp,
  b.lat lat,
  b.lon lon
FROM `bigquery-public-data.noaa_gsod.gsod2016` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.stn = b.usaf AND a.wban = b.wban
WHERE concat(year, mo, da) BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(PARSE_DATE('%Y%m%d', '20160725'), INTERVAL 7 DAY)) AND '20160725'
)

SELECT
  date,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 20, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 20, temp, NULL)) AS std_temp) data_20km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 50, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 50, temp, NULL)) AS std_temp) data_50km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 100, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 100, temp, NULL)) AS std_temp) data_100km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 200, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 200, temp, NULL)) AS std_temp) data_200km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 500, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 500, temp, NULL)) AS std_temp) data_500km
FROM temperature_data t
WHERE
distance(t.lat, 10.1, t.lon, 10.2) < 2000
GROUP BY date
ORDER BY date

I'll try to explain along with your questions:

How can I get past 7 days of a specific date?

Inside the query temperature_data, notice there WHERE clause has the condition:

WHERE concat(year, mo, da) BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(PARSE_DATE('%Y%m%d', '20160725'), INTERVAL 7 DAY)) AND '20160725'

This is where the last 7 days are selected from a given date. You can choose the date you want to analyze just by changing the value '20160725'.

Can max and min lat/lon be calculated directly through the query?

Yes. I imagine you mean by that is if it's possible to select spatial points within a given range (say 20km for instance). One way of doing so is defining a temporary function to compute distances between a desired point and the station points, which is expressed in the query by:

CREATE TEMP FUNCTION distance(lat1 FLOAT64, lat2 FLOAT64, lon1 FLOAT64, lon2 FLOAT64) AS((
WITH data AS(
SELECT POW(SIN((ACOS(-1) / 180 * (lat1 -lat2)) / 2), 2) + COS(ACOS(-1) / 180 * (lat1)) * COS(ACOS(-1) / 180 * (lat2)) * POW(SIN((ACOS(-1) / 180 * (lon1 -lon2)) / 2), 2) a
)
SELECT 6371 * 2 * ATAN2(SQRT((SELECT a FROM data)), SQRT(1 - (SELECT a FROM data)))
));

You can play around and test this function, for instance:

SELECT distance(50, 60, 30, 10) # result is ~ 1680km

This function is used here:

WHERE
distance(t.lat, 10.1, t.lon, 10.2) < 2000

to filter out points more than 2000km farther away from (10.1°, 10.2°). In your query, you can choose a different input value instead of (10.1°, 10.2°).

Very often it does not find any data because most likely the radius of 20km is too small to find a station. How can I modify the query to find the closest stations if it cannot find it within 20km radius?

One possible solution is querying for several different distances at once:

SELECT
  date,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 20, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 20, temp, NULL)) AS std_temp) data_20km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 50, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 50, temp, NULL)) AS std_temp) data_50km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 100, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 100, temp, NULL)) AS std_temp) data_100km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 200, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 200, temp, NULL)) AS std_temp) data_200km,
  STRUCT(AVG(IF(distance(t.lat, 10.1, t.lon, 10.2) < 500, temp, NULL)) AS avg_temp, STDDEV_SAMP(IF(distance(t.lat, 10.1, t.lon, 10.2) < 500, temp, NULL)) AS std_temp) data_500km
FROM temperature_data t
WHERE
distance(t.lat, 10.1, t.lon, 10.2) < 2000
GROUP BY date

Notice that this query is extracting station points ranging from the input point (10.1°, 10.2°) up to 2000km away. And then a filter is applied to select points within the ranges 20km, 50km, 100km, 200km and 500km.

You can change those values as you see fit. If you want to get the average temperature from another point, say (40°, 30°), just change the values (10.1, 10.2) to (40, 30) and you are good to go. Also, if you want different distances from this point, you can change the expressions IF(distance(t.lat, 10.1, t.lon, 10.2) < 200 for instance to a range that better suits your needs.

Notice that the WHERE clause has the condition:

distance(t.lat, 10.1, t.lon, 10.2) < 2000

So this is filtering out all stations farther away from the point (10.1, 10.2) by more than 2000km. You could also change this value as you see fit.

Final note on this one: I also brought the STDDEV_SAMP which is the standard deviation of a sampling. This might be of some value to you as well as it gives you an idea of how much the average is spreading around the mean (corrected by the sampling data size effect). The average by itself is not that valuable if we don't know how close we truly are to the correct value.

Is there better, free, historical weather data I can get?

Don't know. Hopefully this public dataset will turn out to be good enough for you.

Upvotes: 2

Lalaith7
Lalaith7

Reputation: 29

With the information you've given I'm not sure if you can compute the max/min data in query. Working in Legacy SQL I would probably try nesting multiple queries, or joining to a query that calculates them, or both.

You might also be able to write something that adjusts the search query when necessary but I'm just not getting the structure of what you're doing already enough to write a suggestion.

For the other questions:

Getting averages - rather than using * to call everything you're going to have to individually call which columns to average and which to ignore or group by.

Selecting the past 7 days of a specific date - it's very unfortunate there doesn't appear to be a timestamp column so you'll have to force one.

In LegacySQL I would write something like this:

SELECT dte, avg_temp, avg_cnt_temp
FROM 
(SELECT CAST(CONCAT(a.year, '-', a.mo, '-', a.da) AS timestamp) AS dte,
/* This is calling the separate year, month, and day strings as a 
datetime funtion so I can use date_add later */ 
AVG(a.temp) AS avg_temp, AVG(a.count_temp) AS avg_cnt_temp /* You'll 
want to include all of the data you're wanting to call here, I 
only tested with these two */
FROM [bigquery-public-data:noaa_gsod.gsod2016] AS a
JOIN [bigquery-public-data:noaa_gsod.stations] AS b
ON a.stn=b.usaf AND a.wban=b.wban
GROUP BY dte, mo, da)
WHERE dte >= (DATE_ADD('2016-12-31 00:00:00', -7, "DAY")) AND dte <= 
TIMESTAMP('2016-12-31 00:00:00') /* replace with your date */

I think in standard SQL you don't nest the same way.

Don't call a station identifier if you want to combine data across stations, etc.

Upvotes: 0

Related Questions