Reputation: 23
I have 2 DataFrames:
siccd date retp
0 2892 31135 -0.036296
1 2892 31226 0.144768
2 2892 31320 0.063924
3 1650 31412 -0.009190
4 1299 31502 0.063326
and
start end ind indname
0 100 999 1 Agric
1 1000 1299 2 Mines
2 1300 1399 3 Oil
3 1400 1499 4 Stone
4 1500 1799 5 Cnstr
5 2000 2099 6 Food
6 2100 2199 7 Smoke
7 2200 2299 8 Txtls
8 2300 2399 9 Apprl
9 2400 2499 10 Wood
10 2500 2599 11 Chair
11 2600 2661 12 Paper
12 2700 2799 13 Print
13 2800 2899 14 Chems
14 2900 2999 15 Ptrlm
15 3000 3099 16 Rubbr
16 3100 3199 17 Lethr
17 3200 3299 18 Glass
18 3300 3399 19 Metal
The task is to take the df1['siccd']
column, compare it to the df2['start']
and df2['end']
column. If (start <= siccd <= end), assign the ind and indname values of that respective index in the second DataFrame to the first DataFrame. The output would look something like:
siccd date retp ind indname
0 2892 31135 -0.036296 14 Chems
1 2892 31226 0.144768 14 Chems
2 2892 31320 0.063924 14 Chems
3 1650 31412 -0.009190 5 Cnstr
4 1299 31502 0.063326 2 Mines
I've tried doing this with crude nested for loops, and it provides me with the correct lists that I can append to the end of the DataFrame, however this is extremely inefficient and given the data set's length it is inadequate.
siccd_lst = list(tmp['siccd'])
ind_lst = []
indname_lst = []
def categorize(siccd, df, index):
if (siccd >= df.iloc[index]['start']) and (siccd <= df.iloc[index]['end']):
ind_lst.append(df.iloc[index]['ind'])
indname_lst.append(df.iloc[index]['indname'])
else:
pass
for i in range(0, len(ff38.index)-1):
[categorize(x, ff38, i) for x in siccd_lst]
I have also attempted to vectorize the problem, however, I could not figure out how to iterate through the entire df2 when "searching" for the correct ind and indname to assign to the first DataFrame.
Upvotes: 2
Views: 71
Reputation: 59549
We'll create a DataFrame where the index is the Interval and the columns are the values we'll want to map, then we can use .loc
with that DataFrame to bring over the data.
If any of your 'siccd'
values lie outside of all intervals you will get a KeyError
, so this method won't work.
dfi = pd.DataFrame({'indname': df2['indname'].to_numpy(), 'ind': df2['ind'].to_numpy()},
index=pd.IntervalIndex.from_arrays(left=df2.start, right=df2.end, closed='both'))
df1[['indname', 'ind']] = dfi.loc[df1.siccd].to_numpy()
You can perform the full merge (all rows in df1 with all rows in df2) using a temporary column ('t'
) and then filter the result where it's in between the values.
Since your second DataFrame seems to have a small number of non-overlapping ranges, the result of the merge shouldn't be prohibitively large, in terms of memory, and the non-overlapping ranges ensure the filtering will result in at most one row remaining for each original row in df1.
If any of your 'siccd'
values lie outside of all intervals the row from the original DataFrame will get dropped.
res = (df1.assign(t=1)
.merge(df2.assign(t=1), on='t', how='left')
.query('siccd >= start & siccd <= end')
.drop(columns=['t', 'start', 'end']))
# siccd date retp ind indname
#13 2892 31135 -0.036296 14 Chems
#32 2892 31226 0.144768 14 Chems
#51 2892 31320 0.063924 14 Chems
#61 1650 31412 -0.009190 5 Cnstr
#77 1299 31502 0.063326 2 Mines
If you expect values to lie outside of some of the intervals, modify the merge. Bring along the original index, subset, which drops those rows and use combine_first
to add them back after the merge. I added a row at the end with a 'siccd'
of 252525
as a 6th row to your original df1
:
res = (df1.reset_index().assign(t=1)
.merge(df2.assign(t=1), on='t', how='left')
.query('siccd >= start & siccd <= end')
.drop(columns=['t', 'start', 'end'])
.set_index('index')
.combine_first(df1) # Adds back rows, based on index,
) # that were outside any Interval
# date ind indname retp siccd
#0 31135.0 14.0 Chems -0.036296 2892.0
#1 31226.0 14.0 Chems 0.144768 2892.0
#2 31320.0 14.0 Chems 0.063924 2892.0
#3 31412.0 5.0 Cnstr -0.009190 1650.0
#4 31502.0 2.0 Mines 0.063326 1299.0
#5 31511.0 NaN NaN 0.151341 252525.0
Upvotes: 3