Reputation: 2907
So I have read https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html
But I was wondering if there was a #standardSQL
way of doing it. So far, I have a lot of challenge converting PARSE_IP and NTH() since the suggested changes in the migration docs have limitations.
Going from PARSE_IP(contributor_ip)
to NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip))
does not work for IPv6 IP addresses.
Going from NTH(1, latitude) lat
to latitude[SAFE_ORDINAL(1)]
does not work since latitude is considered a string.
And there might be more migration problems that I have yet to encounter. Does anyone know how to transform IP addresses into geolocation in BigQuery standard SQL?
P.S. How would I go from geolocation to determining timezone?
edit: So what is the difference between this
#legacySQL
SELECT
COUNT(*) c,
city,
countryLabel,
NTH(1, latitude) lat,
NTH(1, longitude) lng
FROM (
SELECT
INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
FROM
[publicdata:samples.wikipedia]
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN EACH
[fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
and
SELECT
COUNT(*) c,
city,
countryLabel,
ANY_VALUE(latitude) lat,
ANY_VALUE(longitude) lng
FROM (
SELECT
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) AS INT64)
ELSE NULL
END AS clientIpNum,
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) / (256*256) AS INT64)
ELSE NULL
END AS classB
FROM
`publicdata.samples.wikipedia`
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
edit2: Seems like I manage to figure out the problem via not casting a float correctly. Right now, the standard SQL returns 41815 rows instead the 56347 rows from the legacy SQL which may be due to the lack of conversion from IPv6 to int for standard SQL, but it might be due to something else. Also the legacy SQL query performs much better, running at about 10 seconds instead of the full minute from the standard SQL.
Upvotes: 3
Views: 3341
Reputation: 353
The answer to this question is not valid for ipv6 addresses.
Following the approach described here https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 I came up with this solution:
WITH test_data AS (
SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL
SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
)
JOIN `demo_bq_dataset.geoip_city_v4`
USING (network_bin, mask)
), ipv6 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
)
JOIN `demo_bq_dataset.geoip_city_v6`
USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL
SELECT * FROM ipv6d
In order to get the geoip_city_v4
and geoip_city_v6
you need to download the geoip database from https://maxmind.com/
You can follow this tutorial in to update and prepare you dataset hodo.dev/posts/post-37-gcp-bigquery-geoip.
Upvotes: 0
Reputation: 705
According to https://gist.github.com/matsukaz/a145c2553a0faa59e32ad7c25e6a92f7
#standardSQL
SELECT
id,
IFNULL(city, 'Other') AS city,
IFNULL(countryLabel, 'Other') AS countryLabel,
latitude,
longitude
FROM (
SELECT
id,
NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip)) AS clientIpNum,
TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip))/(256*256)) AS classB
FROM
`<project>.<dataset>.log` ) AS a
LEFT OUTER JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
ORDER BY
id ASC
Upvotes: 3