Reputation: 582
I'm working with a database called international_education
from the world_bank_intl_education
dataset of bigquery-public-data
.
FIELDS
country_name
country_code
indicator_name
indicator_code
value
year
My aim is to plot a line graph with countries who have had the biggest and smallest change in Population growth (annual %) (one of the indicator_name
values).
I have done this below using two partitions finding the first and last value of the year by each country but I'm rough on my SQL and am wondering if there is a way to optimize this formula.
query = """
WITH differences AS
(
SELECT country_name, year, value,
FIRST_VALUE(value)
OVER (
PARTITION BY country_name
ORDER BY year
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS small_value,
LAST_VALUE(value)
OVER (
PARTITION BY country_name
ORDER BY year
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS large_value
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)'
ORDER BY year
)
SELECT country_name, year, (large_value-small_value) AS total_range, value
FROM differences
ORDER BY total_range
"""
Convert to pandas dataframe.
df= wbed.query_to_pandas_safe(query)
df.head(10)
Resulting table.
country_name year total_range value
0 United Arab Emirates 1970 -13.195183 14.446942
1 United Arab Emirates 1971 -13.195183 16.881671
2 United Arab Emirates 1972 -13.195183 17.689814
3 United Arab Emirates 1973 -13.195183 17.695296
4 United Arab Emirates 1974 -13.195183 17.125615
5 United Arab Emirates 1975 -13.195183 16.211873
6 United Arab Emirates 1976 -13.195183 15.450884
7 United Arab Emirates 1977 -13.195183 14.530119
8 United Arab Emirates 1978 -13.195183 13.033461
9 United Arab Emirates 1979 -13.195183 11.071306
I would then plot this with python as follows.
all_countries = df.groupby('country_name', as_index=False).max().sort_values(by='total_range').country_name.values
countries = np.concatenate((all_countries[:3], all_countries[-4:]))
plt.figure(figsize=(16, 8))
sns.lineplot(x='year',y='value', data=df[df.country_name.isin(countries)], hue='country_name')
Upvotes: 0
Views: 127
Reputation: 655
If I understand correctly what you are trying to calculate, I wrote a query that do everything in BigQuery without the need to do anything in pandas. This query returns all the rows for each country that rank top 3 or bottom 3 in change in Population growth.
WITH differences AS
(
SELECT
country_name,
year,
value,
LAST_VALUE(value) OVER (PARTITION BY country_name ORDER BY year) - FIRST_VALUE(value) OVER (PARTITION BY country_name ORDER BY year) AS total_range,
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)'
ORDER BY year
)
, differences_with_ranks as (
SELECT
country_name,
year,
value,
total_range,
row_number() OVER (PARTITION BY country_name ORDER BY total_range) as rank,
FROM differences
)
, top_bottom as (
SELECT
country_name
FROM (
SELECT
country_name,
FROM differences_with_ranks
WHERE rank = 1
ORDER BY total_range DESC
LIMIT 3
)
UNION DISTINCT
SELECT
country_name
FROM (
SELECT
country_name,
FROM differences_with_ranks
WHERE rank = 1
ORDER BY total_range ASC
LIMIT 3
)
)
SELECT
*
FROM differences
WHERE country_name in (SELECT country_name FROM top_bottom)
I don't really understand what do you mean with "optimize", this query run very fast (1.5 seconds) if you need a solution with lower latency BigQuery is not the right solution.
Upvotes: 0
Reputation: 173190
thought there might have been a quicker way to write this query as seemed a bit long winded
I think below is the least verbose version (BigQuery Standard SQL)
#standardSQL
SELECT
country_name,
year,
(LAST_VALUE(value) OVER(win) - FIRST_VALUE(value) OVER(win)) AS total_range,
value
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)'
WINDOW win AS (PARTITION BY country_name ORDER BY YEAR RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY total_range
... producing same result as your original query
Upvotes: 2
Reputation: 1271003
You don't need the CTE and you don't need the window frame definitions. So this should be equivalent:
SELECT country_name, year, value,
(first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR DESC) -
first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR)
) as total_range
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)';
Note that LAST_VALUE()
is finicky with window frame definitions. So I routinely just use FIRST_VALUE()
with the order by reversed.
If you want just one row per country, then you need aggregation. BigQuery doesn't have "first" and "last" aggregation functions, but they are very easy to do with arrays:
SELECT country_name,
((array_agg(value ORDER BY year DESC LIMIT 1))[ordinal(1)] -
(array_agg(value ORDER BY year LIMIT 1))[ordinal(1)]
) as total_range
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)'
GROUP BY country_name
ORDER BY total_range;
Upvotes: 2