klex52s
klex52s

Reputation: 437

Generate Random Coordinates in United States?

I want to generate a random set of latitude and longitude coordinates in the US (including Hawaii and Alaska). I tried using a shapefile from the National Weather Service (https://www.weather.gov/gis/USstates ) but it was generating points in the middle of the ocean. What is the best way of doing this? I thought about defining my own polygon in the interior US but that would exclude some states. I’ve also seen other similar questions where they used a CSV list of US cities, but I’d rather it be completely random.

Upvotes: 2

Views: 1560

Answers (1)

Gene Burinsky
Gene Burinsky

Reputation: 10203

This one requires geopandas but it's a quick and standard solution for sampling within odd shapes (called Monte Carlo Sampling ). Most of the comments below question outline the same concept.

Solution

# grab shape within which to sample
url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_nation_20m.zip"
us = gpd.read_file(url).explode()
## filter out parts of the US that are far away from mainland, I have no idea what they are (Guam islands?)
us = us.loc[us.geometry.apply(lambda x: x.exterior.bounds[2])<-60]

# grab bounding box within which to generate random numbers
x_min,y_min,x_max,y_max = us.geometry.unary_union.bounds 

# the sampling
np.random.seed(2) # set seed (needed for reproducible results
N = 10000
rndn_sample = pd.DataFrame({'x':np.random.uniform(x_min,x_max,N),'y':np.random.uniform(y_min,y_max,N)}) # actual generation
# re-save results in a geodataframe
rndn_sample = gpd.GeoDataFrame(rndn_sample, geometry = gpd.points_from_xy(x=rndn_sample.x, y=rndn_sample.y),crs = us.crs)

# filtering
inUS = rndn_sample['geometry'].apply(lambda s: s.within(us.geometry.unary_union)) # check if within the U.S. bounds
rndn_sample.loc[inUS,:].plot() # plot for visual inspection of results

Explanation

Grab US outline within which we want to randomly sample

# grab shapefile of the US from an official source
url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_nation_20m.zip"
us = gpd.read_file(url).explode() 

Note, with explode(), I expand the multi-part polygon into separate rows. This enables for easier filtering of the area we are interested in because we can grab bounds for each part of the multi-part polygon as below. Note that -60 is just an approximate longitude of the most eastern part of mainland US (Puerto Rico). Feel free to decrease it to exclude PR

## filter out parts of the US that are far away from mainland, I have no idea what they are (Guam islands?)
us = us.loc[us.geometry.apply(lambda x: x.exterior.bounds[2])<-60]

Generate bounds for box within which to generate random coordinates

# grab bounding box within which to generate random numbers
x_min,y_min,x_max,y_max = us.geometry.unary_union.bounds # save min and max x/y coords

Note, unary_union is used to re-combine the individual rows into a single multi-part polygon and the bounds is used to grab the min/max of the x & y coordinates on the filtered subset of the U.S. (ie without guam islands)

Sample within the box

np.random.seed(2) # set seed (needed for reproducible results
N = 10000
rndn_sample = pd.DataFrame({'x':np.random.uniform(x_min,x_max,N),'y':np.random.uniform(y_min,y_max,N)}) # actual generation
# re-save results in a geodataframe
rndn_sample = gpd.GeoDataFrame(rndn_sample, geometry = gpd.points_from_xy(x=rndn_sample.x, y=rndn_sample.y),crs = us.crs)

Check which random points are within US bounds

inUS = rndn_sample['geometry'].apply(lambda s: s.within(us.geometry.unary_union)) # check if within the U.S. bounds
rndn_sample.loc[inUS,:].plot() # plot for visual inspection of results

Side note

Btw, here are the needed libraries in case it's ambiguous

# load libraries
import pandas as pd
import geopandas as gpd
import numpy as np

Notes On Sampling Efficiency

  1. The smaller the box within which we are sampling, the more efficient is the sampler (only about 20% of sampled points land within U.S. territories in the current answer).
  2. A more efficient solution would be to instead create a separate box for each of the areas (ie continental US, Hawaii, Alaska, etc), decide randomly from which one to sample (ie draw randomly either Hawaii, Alaska, or continental US based on, say, their land size relative to the total US size), then sample within that particular area. It's more work upfront but it could be significantly faster.

Upvotes: 7

Related Questions