AntonioGL
AntonioGL

Reputation: 19

select dataframe values based on a list

I have a dataframe with coordinates and elevation of 1259 data.

df_elevation

Longitud    Latitud Elevación
0   -5.879263   42.579535   937
1   -5.879303   42.579535   937
2   -5.879342   42.579535   937
3   -5.879382   42.579535   937
4   -5.879422   42.579535   937
... ... ... ...
1255    -5.880498   42.582213   933
1256    -5.880538   42.582213   933
1257    -5.880578   42.582213   933
1258    -5.880618   42.582213   933
1259    -5.880657   42.582213   933
1260 rows × 3 columns

I have a list that makes up a polygon of coordinates.

lat_list = [42.582213356031694, 42.57966169458114, 42.57945629314298, 42.582142258520136, 42.582213356031694]

lon_list = [-5.880088806152344, -5.880657434463501, -5.879863500595092, -5.879262685775757, -5.880088806152344]

I want to select only the data from the dataframe that is inside this polygon, or delete the data from the dataframe that is outside the polygon

Upvotes: 0

Views: 160

Answers (3)

Matthew Borish
Matthew Borish

Reputation: 3096

If you have a large dataset, I suggest using a spatial index as it will greatly reduce processing time. Geopandas has a slick implementation of the R-tree spatial index which is explained very nicely by Geoff Boeing

Here is an example that expands upon @RJ's answer. First we'll build the df again.

from shapely.geometry import Point, Polygon
import pandas as pd

data = [ { "ID": 0, "Longitud": -5.879263, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 1, "Longitud": -5.879303, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 2, "Longitud": -5.879342, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 3, "Longitud": -5.879382, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 4, "Longitud": -5.879422, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 1255, "Longitud": -5.880498, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1256, "Longitud": -5.880538, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1257, "Longitud": -5.880578, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1258, "Longitud": -5.880618, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1259, "Longitud": -5.880657, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1260, "Longitud": -5.879323515030888, "Latitud": 42.58192907018969, "Elevación": 933 }, { "ID": 1261, "Longitud": -5.879799662054768, "Latitud": 42.58143025825665, "Elevación": 933 }, { "ID": 1262, "Longitud": -5.880003215470649, "Latitud": 42.58117728748368, "Elevación": 933 } ]
    df = pd.DataFrame(data)
    df = df.set_index('ID')

lat_list = [42.582213356031694, 42.57966169458114, 42.57945629314298, 42.582142258520136, 42.582213356031694]
lon_list = [-5.880088806152344, -5.880657434463501, -5.879863500595092, -5.879262685775757, -5.880088806152344]
polygon = Polygon(zip(lon_list, lat_list))

Next, we will create a geodataframe using gpd's points_from_xy().

gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Longitud'], df['Latitud']))

To demonstrate the time savings a spatial index gives, we can expand our geodataframe in a dummy fashion. We create a list of our gdf, and then us pd.concat() so we have a much larger geodataframe. This gives us 130,000 rows rather than 13.

gdf_list = [gdf] * 10000
gdf_cat = pd.concat(gdf_list)

Finally, we create the spatial index, and then use it to return the rows with points inside of the polygon. Note that using the timeit%% magic command in Jupyter can cause variables fail to be saved.

%%timeit
spatial_index = gdf_cat.sindex
possible_matches_index = list(spatial_index.intersection(polygon.bounds))
possible_matches = gdf_cat.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(polygon)]

Using the timit magic command in Jupyter we can see about a 3x speedup over the apply method using a spatial index.

1.95 s ± 48.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here is the check_polygon function used with apply() and an expansion of the original df.

def check_polygon(row):
    return Point(row['Longitud'], row['Latitud']).within(polygon)


df_list = [df] * 10000
df_cat = pd.concat(df_list)

After expanding we demonstrate speed without a spatial index.

%%timeit
df_cat['inpolygon'] = df_cat.apply(check_polygon, axis=1)
df_cat_slice = df_cat[df_cat['inpolygon'] == True]

And see it's quite a bit slower.

6.23 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

RJ Adriaansen
RJ Adriaansen

Reputation: 9649

You can use shapely to create a points and polygons and then check whether a point is in a polygon with within. In this example I'm running it through a function that creates an extra column indicating whether the point is in the polygon or not. Then you can filter the df on that. Note that I added some sample data because none of the points in your sample df are actually in the polygon:

from shapely.geometry import Point, Polygon
import pandas as pd

data = [ { "ID": 0, "Longitud": -5.879263, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 1, "Longitud": -5.879303, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 2, "Longitud": -5.879342, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 3, "Longitud": -5.879382, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 4, "Longitud": -5.879422, "Latitud": 42.579535, "Elevación": 937 }, { "ID": 1255, "Longitud": -5.880498, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1256, "Longitud": -5.880538, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1257, "Longitud": -5.880578, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1258, "Longitud": -5.880618, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1259, "Longitud": -5.880657, "Latitud": 42.582213, "Elevación": 933 }, { "ID": 1260, "Longitud": -5.879323515030888, "Latitud": 42.58192907018969, "Elevación": 933 }, { "ID": 1261, "Longitud": -5.879799662054768, "Latitud": 42.58143025825665, "Elevación": 933 }, { "ID": 1262, "Longitud": -5.880003215470649, "Latitud": 42.58117728748368, "Elevación": 933 } ]
df = pd.DataFrame(data)
df = df.set_index('ID')

lat_list = [42.582213356031694, 42.57966169458114, 42.57945629314298, 42.582142258520136, 42.582213356031694]
lon_list = [-5.880088806152344, -5.880657434463501, -5.879863500595092, -5.879262685775757, -5.880088806152344]
polygon = Polygon(zip(lon_list, lat_list))

def check_polygon(row):
    return Point(row['Longitud'], row['Latitud']).within(polygon)

df['inpolygon'] = df.apply(check_polygon, axis=1)
df = df[df['inpolygon'] == True]

Output:

ID Longitud Latitud Elevación inpolygon
1260 -5.87932 42.5819 933 True
1261 -5.8798 42.5814 933 True
1262 -5.88 42.5812 933 True

Upvotes: 2

Cheta
Cheta

Reputation: 1

I would first create a shapely polygon from the coordinates using shapely.geometry.Polygon and the i would also convert all the coordinates into shapely.geometry.Point objects and use the contains() method to see which points are inside your polygon. Then you simply index out the rest. You can do this also using geopandas but it's optional

You can see how the contains() method works here

Upvotes: 0

Related Questions