Reputation: 395
I have two different data frames (nv1 and db1
). I want to create a new column in nv1
named Novel_in_Database
with value "Yes"
or "No"
based on multiple conditions.
The conditions are: The value of column MinBP
of nv1
is greater than the value of column min_bp
of db1
and smaller than max_bp
of db1
. Or The value of column MaxBP
of nv1
is greater than the value of min_bp
of db1
and smaller than max_bp
of db1
. Did I clarify the conditions?
Here are my data frames.
db1
min_bp max_bp
11 22
20 30
38 52
And
nv1
MinBP MaxBP
10 20
15 25
36 50
60 80
85 96
The new nv1 would be:
MinBP MaxBP Novel_in_Database
10 20 No
15 25 No
36 50 Yes
60 80 Yes
85 96 Yes
So far, I have tried as follows,
nv1['Novel_in_Database']=np.where((((nv1.MinBP >= db1.min_bp) & (nv1.MinBP <= db1.max_bp)) |
((nv1.MaxBP <= db1.max_bp) & (nv1.MaxBP >= db1.min_bp))), 'No', 'Yes')
But it is giving me error: ValueError: Can only compare identically-labeled Series objects
. These two data frames are of different shapes. Any help?
Upvotes: 0
Views: 607
Reputation: 93171
You can use numpy's array broadcasting to all elements in an array against all elements in another array:
a = nv1['MinBP'].to_numpy()
b = nv1['MaxBP'].to_numpy()
# Raise the columns in db1 by a dimension to enable numpy's broadcasting
c = db1['min_bp'].to_numpy()[:, None]
d = db1['max_bp'].to_numpy()[:, None]
result = ((a > c) & (a < d)) | ((b > c) & (b < d))
nv1['Novel_in_Database'] = np.where(result.any(axis=0), 'No', 'Yes')
Upvotes: 1
Reputation: 120439
Use merge
with how='cross'
to create a new dataframe that contains all combinations of (nv1, db1)
:
df = pd.merge(nv1.reset_index(), db1, how='cross')
Here is the output:
>>> df
index MinBP MaxBP min_bp max_bp
0 0 10 20 11 22
1 0 10 20 20 30
2 0 10 20 38 52
3 1 15 25 11 22
4 1 15 25 20 30
5 1 15 25 38 52
6 2 36 50 11 22
7 2 36 50 20 30
8 2 36 50 38 52
9 3 60 80 11 22
10 3 60 80 20 30
11 3 60 80 38 52
12 4 85 96 11 22
13 4 85 96 20 30
14 4 85 96 38 52
Now check your conditions:
df['Novel_in_Database'] = df['MinBP'].between(df['min_bp'], df['max_bp']) \
| df['MaxBP'].between(df['min_bp'], df['max_bp'])
For each group from nv1
, find all new rows:
nv1['Novel_in_Database'] = (~df.groupby('index')['Novel_in_Database'].any()) \
.replace({True: 'yes', False: 'no'})
>>> nv1
MinBP MaxBP Novel_in_Database
0 10 20 no
1 15 25 no
2 36 50 no
3 60 80 yes
4 85 96 yes
Upvotes: 3