Reputation: 2122
I posted a question along the same lines yesterday. This is a slightly modified version of it. previous question here.
I have 2 dataframes as follows:
data1
looks like this:
id address
1 11123451
2 78947591
data2
looks like the following:
lowerbound_address upperbound_address place
78392888 89000000 X
10000000 20000000 Y
I want to create another column in data1 called "place" which contains the place the id is from. There will be many ids coming from the same place. And some ids don't have a match.
The addresses here are float values.
What I am actually looking for in Python
is an equivalent of this in R
. It's easier to code the following in R
. But I am unsure of how to code this in Python
. Can someone help me with this?
data_place = rep(NA, nrow(data1))
for (i in (1:nrow(data1)){
tmp = as.character(data2[data1$address[i] >= data2$lowerbound_address & data1$address[i] <= data2$upperbound_address, "place"])
if(length(tmp)==1) {data_place[i] = tmp}
}
data$place = data_place
Upvotes: 0
Views: 421
Reputation: 61967
Do a merge_asof and then replace all those times that the address is out of bounds with nan.
data1.sort_values('address', inplace = True)
data2.sort_values('lowerbound_address', inplace=True)
data3 = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
data3['place'] = data3['place'].where(data3.address <= data3.upperbound_address)
data3.drop(['lowerbound_address', 'upperbound_address'], axis=1)
Output
id address place
0 1 11123451 Y
1 3 50000000 NaN
2 2 78947591 X
Upvotes: 0
Reputation: 15953
Something like this would work.
import pandas as pd
import numpy as np
# The below section is only used to import data
from io import StringIO
data = """
id address
1 11123451
2 78947591
3 50000000
"""
data2 = """
lowerbound_address upperbound_address place
78392888 89000000 X
10000000 20000000 Y
"""
# The above section is only used to import data
df = pd.read_csv(StringIO(data), delimiter='\s+')
df2 = pd.read_csv(StringIO(data2), delimiter='\s+')
df['new']=np.nan
df['new'][(df['address'] > df2['lowerbound_address'][0]) & (df['address'] < df2['upperbound_address'][0])] = 'X'
df['new'][(df['address'] > df2['lowerbound_address'][1]) & (df['address'] < df2['upperbound_address'][1])] = 'Y'
In addition to pandas
, we used numpy
for np.nan
.
All I have done was create a new column and assign NaN
to it. Then created two criteria to assign either X
or 'Y' based on the upper and lower boundaries in the second data (last two lines).
Final results:
id address new
0 1 11123451 Y
1 2 78947591 X
2 3 50000000 NaN
Upvotes: 2