TvCasteren
TvCasteren

Reputation: 425

Pandas: Most resource efficient way to apply function

I have two dataframes, one containing a column with points, and another one containing a polygon. The data looks like this:

>>> df1
   Index            Point
0      1  POINT (100 400)
1      2  POINT (920 400)
2      3  POINT (111 222)

>>> df2
   Index    Area-ID                                            Polygon
0      1   New York  POLYGON ((226000 619000, 226000 619500, 226500...
1      2  Amsterdam  POLYGON ((226000 619000, 226000 619500, 226500...
2      3     Berlin  POLYGON ((226000 619000, 226000 619500, 226500...

Reproducible example:

import pandas as pd
import shapely.wkt

data = {'Index': [1, 2, 3],
        'Point': ['POINT (100 400)', 'POINT (920 400)', 'POINT (111 222)']}
df1 = pd.DataFrame(data)
df1['Point'] = df1['Point'].apply(shapely.wkt.loads)

data = {'Index': [1, 2, 3],
        'Area-ID': ['New York', 'Amsterdam', 'Berlin'],
        'Polygon': ['POLYGON ((90 390, 110 390, 110 410, 90 410, 90 390))',
                    'POLYGON ((890 390, 930 390, 930 410, 890 410, 890 390))',
                    'POLYGON ((110 220, 112 220, 112 225, 110 225, 110 220))']}
df2 = pd.DataFrame(data)
df2['Polygon'] = df2['Polygon'].apply(shapely.wkt.loads)

With shapely's function 'polygon.contains' I can check whether a polygon contains a certain point. The goal is to find the corresponding polygon for every point in dataframe 1.

The following approach works, but takes way too long considering the datasets are very large:

for index, row in dataframe1.iterrows():
    print(index)
    for index, row2 in dataframe2.iterrows():
        if row2['Polygon'].contains(row[Point']):
            dataframe1.iloc[index]['Area-ID'] = row2['Area-ID']

Is there a more time-efficient way to achieve this goal?

Upvotes: 1

Views: 154

Answers (1)

zabop
zabop

Reputation: 7912

If every point is contained by exactly one polygon (as it does in the current form of the question), you can do:

df1=\
df1.assign(cities=df1.Point.apply(lambda point:
                                    df2['Area-ID'].loc[
                                        [i for i, polygon in enumerate(df2.Polygon)
                                        if polygon.contains(point)][0]
                                        ]))

You'll get:

   Index            Point     cities
0      1  POINT (100 400)   New York
1      2  POINT (920 400)  Amsterdam
2      3  POINT (111 222)     Berlin

Upvotes: 1

Related Questions