Meruem
Meruem

Reputation: 33

Quickest way to find partial string match between two pandas dataframes

I have two location-based pandas DataFrames.

df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.

df1.address[0] --> "Avon Road, Ealing, London, UK"

df2: Which just has cities of UK, such as "London".

df2.city[5] --> "London"

I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.

df1.city[0] --> "London"

Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.

I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:

df1['address'].str.contains("London",na=False).index.values  

Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.

I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.

Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)

Upvotes: 2

Views: 629

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50806

Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster. Here is an example code:

cityIndex = set(df2.city)

addressLocations = []
for address in df1.address:
    location = None
    # Warning: ignore characters like '-' in the cities
    for word in re.findall(r'[a-zA-Z0-9]+', address):
        if word in cityIndex:
            location = word
            break
    addressLocations.append(location)
df1['city'] = addressLocations

Upvotes: 2

Related Questions