Reputation: 33
I have two location-based pandas DataFrames.
df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.
df1.address[0] --> "Avon Road, Ealing, London, UK"
df2: Which just has cities of UK, such as "London".
df2.city[5] --> "London"
I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.
df1.city[0] --> "London"
Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.
I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:
df1['address'].str.contains("London",na=False).index.values
Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.
I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.
Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)
Upvotes: 2
Views: 629
Reputation: 50806
Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster. Here is an example code:
cityIndex = set(df2.city)
addressLocations = []
for address in df1.address:
location = None
# Warning: ignore characters like '-' in the cities
for word in re.findall(r'[a-zA-Z0-9]+', address):
if word in cityIndex:
location = word
break
addressLocations.append(location)
df1['city'] = addressLocations
Upvotes: 2