KLNG
KLNG

Reputation: 23

Merging two DataFrames in Pandas, based on conditions

I have 2 DataFrames:

      siccd   date    retp  
0     2892  31135 -0.036296   
1     2892  31226  0.144768   
2     2892  31320  0.063924   
3     1650  31412 -0.009190   
4     1299  31502  0.063326   

and

    start   end  ind indname
0     100   999    1   Agric
1    1000  1299    2   Mines
2    1300  1399    3     Oil
3    1400  1499    4   Stone
4    1500  1799    5   Cnstr
5    2000  2099    6    Food
6    2100  2199    7   Smoke
7    2200  2299    8   Txtls
8    2300  2399    9   Apprl
9    2400  2499   10    Wood
10   2500  2599   11   Chair
11   2600  2661   12   Paper
12   2700  2799   13   Print
13   2800  2899   14   Chems
14   2900  2999   15   Ptrlm
15   3000  3099   16   Rubbr
16   3100  3199   17   Lethr
17   3200  3299   18   Glass
18   3300  3399   19  Metal 

The task is to take the df1['siccd'] column, compare it to the df2['start'] and df2['end'] column. If (start <= siccd <= end), assign the ind and indname values of that respective index in the second DataFrame to the first DataFrame. The output would look something like:

      siccd   date    retp   ind  indname
0     2892  31135 -0.036296  14   Chems
1     2892  31226  0.144768  14   Chems
2     2892  31320  0.063924  14   Chems
3     1650  31412 -0.009190  5    Cnstr
4     1299  31502  0.063326  2    Mines

I've tried doing this with crude nested for loops, and it provides me with the correct lists that I can append to the end of the DataFrame, however this is extremely inefficient and given the data set's length it is inadequate.

siccd_lst = list(tmp['siccd'])
ind_lst = []
indname_lst = []
def categorize(siccd, df, index):
    if (siccd >= df.iloc[index]['start']) and (siccd <= df.iloc[index]['end']):
        ind_lst.append(df.iloc[index]['ind'])
        indname_lst.append(df.iloc[index]['indname'])
    else:
        pass

for i in range(0, len(ff38.index)-1):
    [categorize(x, ff38, i) for x in siccd_lst]

I have also attempted to vectorize the problem, however, I could not figure out how to iterate through the entire df2 when "searching" for the correct ind and indname to assign to the first DataFrame.

Upvotes: 2

Views: 71

Answers (1)

ALollz
ALollz

Reputation: 59549

Intervals

We'll create a DataFrame where the index is the Interval and the columns are the values we'll want to map, then we can use .loc with that DataFrame to bring over the data.

If any of your 'siccd' values lie outside of all intervals you will get a KeyError, so this method won't work.

dfi = pd.DataFrame({'indname': df2['indname'].to_numpy(), 'ind': df2['ind'].to_numpy()}, 
                   index=pd.IntervalIndex.from_arrays(left=df2.start, right=df2.end, closed='both'))

df1[['indname', 'ind']] = dfi.loc[df1.siccd].to_numpy()

Merge

You can perform the full merge (all rows in df1 with all rows in df2) using a temporary column ('t') and then filter the result where it's in between the values.

Since your second DataFrame seems to have a small number of non-overlapping ranges, the result of the merge shouldn't be prohibitively large, in terms of memory, and the non-overlapping ranges ensure the filtering will result in at most one row remaining for each original row in df1.

If any of your 'siccd' values lie outside of all intervals the row from the original DataFrame will get dropped.

res = (df1.assign(t=1)
          .merge(df2.assign(t=1), on='t', how='left')
          .query('siccd >= start & siccd <= end')
          .drop(columns=['t', 'start', 'end']))

#     siccd   date      retp  ind indname
#13   2892  31135 -0.036296   14   Chems
#32   2892  31226  0.144768   14   Chems
#51   2892  31320  0.063924   14   Chems
#61   1650  31412 -0.009190    5   Cnstr
#77   1299  31502  0.063326    2   Mines

If you expect values to lie outside of some of the intervals, modify the merge. Bring along the original index, subset, which drops those rows and use combine_first to add them back after the merge. I added a row at the end with a 'siccd' of 252525 as a 6th row to your original df1:

res = (df1.reset_index().assign(t=1)
          .merge(df2.assign(t=1), on='t', how='left')
          .query('siccd >= start & siccd <= end')
          .drop(columns=['t', 'start', 'end'])
          .set_index('index')
          .combine_first(df1)  # Adds back rows, based on index,
      )                        # that were outside any Interval 

#      date   ind indname      retp     siccd
#0  31135.0  14.0   Chems -0.036296    2892.0
#1  31226.0  14.0   Chems  0.144768    2892.0
#2  31320.0  14.0   Chems  0.063924    2892.0
#3  31412.0   5.0   Cnstr -0.009190    1650.0
#4  31502.0   2.0   Mines  0.063326    1299.0
#5  31511.0   NaN     NaN  0.151341  252525.0

Upvotes: 3

Related Questions