rbae
rbae

Reputation: 73

Comparing a value from one dataframe with values from columns in another dataframe and getting the data from third column

The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:

>> print a

id | value
 1 | 250
 2 | 150
 3 | 350
 4 | 550
 5 | 450

>> print b

low | high | class
100 | 200  | 'A' 
200 | 300  | 'B' 
300 | 500  | 'A' 
500 | 600  | 'C' 

I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:

>> print a

id | value | class
 1 | 250   | 'B'
 2 | 150   | 'A'
 3 | 350   | 'A'
 4 | 550   | 'C'
 5 | 450   | 'A'

I have the following code written that sort of does what I want:

a['class'] = pd.Series()
for i in range(len(a)):
    val = a['value'][i]
    cl = (b['class'][ (b['low'] <= val) \
                      (b['high'] >= val) ].iat[0])
    a['class'].set_value(i,cl)

Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?

Upvotes: 7

Views: 2099

Answers (2)

Bow
Bow

Reputation: 1027

Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!

I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.

The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!

# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])

blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])

aclass = [[]]*len(avalue) # initialize aclass

start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
    for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
        aclass[j]=bclass[i]

# add the class column to the original a DataFrame
a['class'] = aclass

print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153460

Here is a way to do a range join inspired by @piRSquared's solution:

A = a['value'].values
bh = b.high.values
bl = b.low.values

i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))

pd.DataFrame(
    np.column_stack([a.values[i], b.values[j]]),
    columns=a.columns.append(b.columns)
)

Output:

  id value  low high  class
0  1   250  200  300   'B' 
1  2   150  100  200   'A' 
2  3   350  300  500   'A' 
3  4   550  500  600   'C' 
4  5   450  300  500   'A'

Upvotes: 6

Related Questions