Reputation: 279
I'm looking the way to run function that take two dataframes as arguments df1 and df2.
What I want is to create a new column in df1 from the information in df2 without using a loop because my entire df1 is 3M rows and df2 700k rows.
For that I compare if the data the value of X
of df1 is included in the from
and the to
of df2
I tried with apply of pandas library but I got errors like:
ValueError: Can only compare identically-labeled Series objects
Here is the sample of my code.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})
df2 = pd.DataFrame({
'from': [1.0, 1.5, 1.8, 2.2, 2.6],
'to': [1.5, 1.8, 2.2, 2.6, 2.9],
'Z': ['F1', 'F1', 'F2', 'F1', 'F2'],
'Y': ['foo', 'bar', 'foobar', 'foo', 'zoo']
})
def asign(df1, df2):
if df1['Z'] == df2['Z']:
idx = np.where((df1[X] >= df2['from']) & (df1[X]<= df2['to']))[0]
df1['Y'] = df2['Y'][idx]
return df1
df1.groupby('Z').apply(asign, df2)
The output must be like:
>>> df1
out[0] :
X Z Y
0 1.0 F1 foo
1 1.1 F2 bar
2 1.2 F2 foobar
3 1.3 F1 foo
4 1.4 F1 foobar
5 1.5 F2 bar
6 1.6 F1 foo
7 1.7 F2 bar
The value of the column Y to be created in df1 is conditioned by the fact that the row belongs to the group Z either F1 or F2 and that the value of X is greater or equal to from and less than to Please can you help me to manage this? Thank you
Upvotes: 1
Views: 305
Reputation: 86
pd.cut()
The old solution below works well, but it might not be very efficient as it first creates a large data frame and then selects a subset of rows from it. This solution instead creates bins using pd.cut
and then merges the dataframes, directly creating the desired output.
In addition, this gives additional flexibility on how to make the merge.
df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})
df2 = pd.DataFrame({
'bmin': [1.0, 1.5, 1.8, 2.2, 2.6],
'bmax': [1.5, 1.8, 2.2, 2.6, 2.9],
'Z': ['F1', 'F1', 'F2', 'F1', 'F2'],
'Y': ['foo', 'bar', 'foobar', 'foo', 'zoo']
})
# Adding new column to the dataframes
bins = sorted(df2.bmin.unique()) + [df2.bmax.max()]
df1.loc[:, 'bin'] = pd.cut(
df1.X,
bins=bins,
labels=False, # Makes cut return int indices for the bins
include_lowest=True, # Otherwise 1.0 would be NaN
)
df2.loc[:, 'bin'] = pd.cut(
0.5 * (df2.bmin + df2.bmax),
bins=bins,
labels=False,
include_lowest=True,
)
# Merge on all relevant columns. Change how to 'inner' for an inner join
merged = pd.merge(df1, df2, on=["Z", "bin"], how='outer')
Sample of the output
X Z bin bmin bmax Y
0 1.0 F1 0 1.0 1.5 foo
1 1.3 F1 0 1.0 1.5 foo
2 1.4 F1 0 1.0 1.5 foo
3 1.1 F2 0 NaN NaN NaN
4 1.2 F2 0 NaN NaN NaN
merge
followed by query
Perhaps you'd be interested in DataFrame.query()
?
In the code below, I use query
on a merge of the dataframes on Z
. Note that the output data from this code differs from the one you write, but I don't see how
1 1.1 F2 bar
could result from your input data since you want both the bin and Z
to match? What I can see, there are no bins encapsulating 1.1 in df2
that also has Z=F2
. Apologies if I didn't understand you question.
Note that I renamed the columns for the bin limits in df2
as you can't use Python keywords in numexpr query.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})
df2 = pd.DataFrame({
'bmin': [1.0, 1.5, 1.8, 2.2, 2.6],
'bmax': [1.5, 1.8, 2.2, 2.6, 2.9],
'Z': ['F1', 'F1', 'F2', 'F1', 'F2'],
'Y': ['foo', 'bar', 'foobar', 'foo', 'zoo']
})
merged = pd.merge(
df1,
df2,
on='Z',
)
merged = merged.query('bmin <= X < bmax')
merged = merged.sort_values(by="X")[['X', 'Z', 'Y']]
Gives the output
X Z Y
0 1.0 F1 foo
3 1.3 F1 foo
6 1.4 F1 foo
10 1.7 F1 bar
50 1.8 F2 foobar
52 1.9 F2 foobar
20 2.2 F1 foo
23 2.3 F1 foo
26 2.4 F1 foo
29 2.5 F1 foo
Upvotes: 1