Marc
Marc

Reputation: 279

Apply a function with two dataframe as argument

I'm looking the way to run function that take two dataframes as arguments df1 and df2.

What I want is to create a new column in df1 from the information in df2 without using a loop because my entire df1 is 3M rows and df2 700k rows. For that I compare if the data the value of X of df1 is included in the from and the to of df2

I tried with apply of pandas library but I got errors like:

ValueError: Can only compare identically-labeled Series objects

Here is the sample of my code.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
                         2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
                    'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
                         'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})

df2 = pd.DataFrame({
           'from': [1.0, 1.5, 1.8, 2.2, 2.6],
           'to': [1.5, 1.8, 2.2, 2.6, 2.9],
           'Z': ['F1', 'F1', 'F2', 'F1', 'F2'],
           'Y': ['foo', 'bar', 'foobar', 'foo', 'zoo']
})
def asign(df1, df2):
    if df1['Z'] == df2['Z']:
        idx = np.where((df1[X] >= df2['from']) & (df1[X]<= df2['to']))[0]
        df1['Y'] = df2['Y'][idx]
        return df1

df1.groupby('Z').apply(asign, df2)

The output must be like:

>>> df1
out[0] : 
    X    Z   Y
0   1.0  F1  foo
1   1.1  F2  bar
2   1.2  F2  foobar
3   1.3  F1  foo
4   1.4  F1  foobar
5   1.5  F2  bar
6   1.6  F1  foo
7   1.7  F2  bar

The value of the column Y to be created in df1 is conditioned by the fact that the row belongs to the group Z either F1 or F2 and that the value of X is greater or equal to from and less than to Please can you help me to manage this? Thank you

Upvotes: 1

Views: 305

Answers (1)

emarcus
emarcus

Reputation: 86

Better solution using pd.cut()

The old solution below works well, but it might not be very efficient as it first creates a large data frame and then selects a subset of rows from it. This solution instead creates bins using pd.cut and then merges the dataframes, directly creating the desired output.

In addition, this gives additional flexibility on how to make the merge.

df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
                         2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
                    'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
                         'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})

df2 = pd.DataFrame({
           'bmin': [1.0,   1.5,   1.8,      2.2,   2.6],
           'bmax': [1.5,   1.8,   2.2,      2.6,   2.9],
           'Z':    ['F1',  'F1',  'F2',     'F1',  'F2'],
           'Y':    ['foo', 'bar', 'foobar', 'foo', 'zoo']
})


# Adding new column to the dataframes
bins = sorted(df2.bmin.unique()) + [df2.bmax.max()]

df1.loc[:, 'bin'] = pd.cut(
    df1.X,
    bins=bins,
    labels=False,        # Makes cut return int indices for the bins
    include_lowest=True, # Otherwise 1.0 would be NaN
)
df2.loc[:, 'bin'] = pd.cut(
    0.5 * (df2.bmin + df2.bmax),
    bins=bins,
    labels=False,
    include_lowest=True,
)

# Merge on all relevant columns. Change how to 'inner' for an inner join
merged = pd.merge(df1, df2, on=["Z", "bin"], how='outer')

Sample of the output

      X   Z  bin  bmin  bmax       Y
0   1.0  F1    0   1.0   1.5     foo
1   1.3  F1    0   1.0   1.5     foo
2   1.4  F1    0   1.0   1.5     foo
3   1.1  F2    0   NaN   NaN     NaN
4   1.2  F2    0   NaN   NaN     NaN

Old solution using merge followed by query

Perhaps you'd be interested in DataFrame.query()?

In the code below, I use query on a merge of the dataframes on Z. Note that the output data from this code differs from the one you write, but I don't see how

1   1.1  F2  bar

could result from your input data since you want both the bin and Z to match? What I can see, there are no bins encapsulating 1.1 in df2 that also has Z=F2. Apologies if I didn't understand you question.

Note that I renamed the columns for the bin limits in df2 as you can't use Python keywords in numexpr query.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
                         2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
                    'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
                         'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})

df2 = pd.DataFrame({
           'bmin': [1.0,   1.5,   1.8,      2.2,   2.6],
           'bmax': [1.5,   1.8,   2.2,      2.6,   2.9],
           'Z':    ['F1',  'F1',  'F2',     'F1',  'F2'],
           'Y':    ['foo', 'bar', 'foobar', 'foo', 'zoo']
})

merged = pd.merge(
    df1, 
    df2,
    on='Z',
)
merged = merged.query('bmin <= X < bmax')
merged = merged.sort_values(by="X")[['X', 'Z', 'Y']]

Gives the output

      X   Z       Y
0   1.0  F1     foo
3   1.3  F1     foo
6   1.4  F1     foo
10  1.7  F1     bar
50  1.8  F2  foobar
52  1.9  F2  foobar
20  2.2  F1     foo
23  2.3  F1     foo
26  2.4  F1     foo
29  2.5  F1     foo

Upvotes: 1

Related Questions