zillur rahman
zillur rahman

Reputation: 395

np.where, multiple and or statement, two data frames

I have two different data frames (nv1 and db1). I want to create a new column in nv1 named Novel_in_Database with value "Yes" or "No" based on multiple conditions.

The conditions are: The value of column MinBP of nv1 is greater than the value of column min_bp of db1 and smaller than max_bp of db1. Or The value of column MaxBP of nv1 is greater than the value of min_bp of db1 and smaller than max_bp of db1. Did I clarify the conditions? Here are my data frames.

db1
min_bp  max_bp
11  22
20  30
38  52

And

nv1
    MinBP   MaxBP
    10  20
    15  25
    36  50
    60  80
    85  96

The new nv1 would be:

MinBP   MaxBP   Novel_in_Database
10  20  No
15  25  No
36  50  Yes
60  80  Yes
85  96  Yes

So far, I have tried as follows,

nv1['Novel_in_Database']=np.where((((nv1.MinBP >= db1.min_bp) & (nv1.MinBP <= db1.max_bp)) |
                                   ((nv1.MaxBP <= db1.max_bp) & (nv1.MaxBP >= db1.min_bp))), 'No', 'Yes')

But it is giving me error: ValueError: Can only compare identically-labeled Series objects. These two data frames are of different shapes. Any help?

Upvotes: 0

Views: 607

Answers (2)

Code Different
Code Different

Reputation: 93171

You can use numpy's array broadcasting to all elements in an array against all elements in another array:

a = nv1['MinBP'].to_numpy()
b = nv1['MaxBP'].to_numpy()

# Raise the columns in db1 by a dimension to enable numpy's broadcasting
c = db1['min_bp'].to_numpy()[:, None]
d = db1['max_bp'].to_numpy()[:, None]

result = ((a > c) & (a < d)) | ((b > c) & (b < d))
nv1['Novel_in_Database'] = np.where(result.any(axis=0), 'No', 'Yes')

Upvotes: 1

Corralien
Corralien

Reputation: 120439

Use merge with how='cross' to create a new dataframe that contains all combinations of (nv1, db1):

df = pd.merge(nv1.reset_index(), db1, how='cross')

Here is the output:

>>> df
    index  MinBP  MaxBP  min_bp  max_bp
0       0     10     20      11      22
1       0     10     20      20      30
2       0     10     20      38      52
3       1     15     25      11      22
4       1     15     25      20      30
5       1     15     25      38      52
6       2     36     50      11      22
7       2     36     50      20      30
8       2     36     50      38      52
9       3     60     80      11      22
10      3     60     80      20      30
11      3     60     80      38      52
12      4     85     96      11      22
13      4     85     96      20      30
14      4     85     96      38      52

Now check your conditions:

df['Novel_in_Database'] = df['MinBP'].between(df['min_bp'], df['max_bp']) \
                        | df['MaxBP'].between(df['min_bp'], df['max_bp'])

For each group from nv1, find all new rows:

nv1['Novel_in_Database'] = (~df.groupby('index')['Novel_in_Database'].any()) \
                           .replace({True: 'yes', False: 'no'})
>>> nv1
   MinBP  MaxBP Novel_in_Database
0     10     20                no
1     15     25                no
2     36     50                no
3     60     80               yes
4     85     96               yes

Upvotes: 3

Related Questions