Matt W.
Matt W.

Reputation: 3722

index search in numpy fast

I'm looking to grab values from one dataframe and append them to another based on index - and to do it fast. Lets say we have the following 2 dataframes:

df has a bunch of random numbers with cities as the column headers df2 is a dataframe full of cities with (unimportant) letters as the headers

In df2 at position df.loc[0,'c'] we have Los Angeles. I want to then go to df and find the Los Angeles column and take the value from the 0 index, and append it to df2 so that it reads Los Angeles : 544

import pandas as pd
import numpy as np

np.random.seed(42)
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
cities = ['Minneapolis', 'Boston', 'Chicago', 'San Francisco', 'Los Angeles', 'Bozeman', 'New York', 'Austin',
                'Dallas', 'New Orleans']
df = pd.DataFrame({city:np.random.randint(1, 1000, 10000) for city in cities})
df = df.astype(str)
df2 = pd.DataFrame({k:[np.random.choice(cities) for i in range(10000)] for k in letters})

df
      Minneapolis  Boston  Chicago  San Francisco  Los Angeles
0             103     720      444            335          544
1             436     833      813            252          418
2             861     140      428            919          339
3             271     792       83            804          801
4             107     546      765            825          320

df2
                  a              b              c            d              e
0     San Francisco    Minneapolis    Los Angeles      Chicago    Los Angeles
1       Minneapolis  San Francisco    Los Angeles      Chicago  San Francisco
2       Minneapolis    Los Angeles         Boston      Chicago         Boston
3           Chicago        Chicago        Chicago       Boston    Los Angeles
4     San Francisco  San Francisco  San Francisco       Boston    Los Angeles

Expected Result:

df3

                        a                    b                    c  \
0     San Francisco : 335    Minneapolis : 103    Los Angeles : 544   
1       Minneapolis : 436  San Francisco : 252    Los Angeles : 418   
2       Minneapolis : 861    Los Angeles : 339         Boston : 140   
3            Chicago : 83         Chicago : 83         Chicago : 83   
4     San Francisco : 825  San Francisco : 825  San Francisco : 825  

I solved this problem by doing a double for loop but it is taking too long since my dataframe is (280k, 260) and I have to do this function twice. I'm looking for a way in numpy to do this. Ideally if I can create a new numpy array where it has numbers as strings in the correct index locations, and I can use np.char.add(df2, df) to just add matrix-wise the words and numbers together. Just trying to figure out a quick way to index these values.

import time
s = time.time()
for col in df2.columns:
    for i in range(0, len(df2)):
        try:
            df2.loc[i, col] += ' : ' + str(np.round(df.loc[i, df2.loc[i, col]], 2))
        except:
            pass
print(time.time()-s)
16.95679211616516

Upvotes: 0

Views: 57

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

You can use lookup:

s = df2.stack().reset_index(name='val');
df2 + ' : ' + df.lookup(s['level_0'], s['val']).reshape(df2.shape)

whose execution time on the 10000-row data is:

44 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Output for the truncated data:

                     a                    b                    c  \
0  San Francisco : 335    Minneapolis : 103    Los Angeles : 544   
1    Minneapolis : 436  San Francisco : 252    Los Angeles : 418   
2    Minneapolis : 861    Los Angeles : 339         Boston : 140   
3         Chicago : 83         Chicago : 83         Chicago : 83   
4  San Francisco : 825  San Francisco : 825  San Francisco : 825   

               d                    e  
0  Chicago : 444    Los Angeles : 544  
1  Chicago : 813  San Francisco : 252  
2  Chicago : 428         Boston : 140  
3   Boston : 792    Los Angeles : 801  
4   Boston : 546    Los Angeles : 320  

Upvotes: 2

Related Questions