Reputation: 951
data = {
'org_id' :[4,73,6,77,21,36,40,22,21,30,31],
'flag': [['4', '73'],['73'],['6', '77'],['77'],['21'],['36', '36'],['40'],['22', '41'],['21'],['22', '30'],['31', '31']],
'r_id' : [4,4,6,6,20,20,20,22,28,28,28]
}
df = pd.DataFrame.from_dict(data)
df
required data frame to be like below,
data = {
'org_id' :[4,73,6,77,21,36,40,22,21,30,31],
'flag': [['4', '73'],['73'],['6', '77'],['77'],['21'],['36', '36'],['40'],['22', '41'],['21'],['22', '30'],['31', '31']],
'r_id' : [4,4,6,6,20,20,20,22,28,28,28],
'is_foundin_org_id': ['yes','yes','yes','yes','NO','NO','NO','yes','NO','NO','NO']
}
df2 = pd.DataFrame.from_dict(data)
df2
output data frame
Out[115]:
org_id flag r_id is_foundin_org_id
0 4 [4, 73] 4 yes
1 73 [73] 4 yes
2 6 [6, 77] 6 yes
3 77 [77] 6 yes
4 21 [21] 20 NO
5 36 [36, 36] 20 NO
6 40 [40] 20 NO
7 22 [22, 41] 22 yes
8 21 [21] 28 NO
9 30 [22, 30] 28 NO
10 31 [31, 31] 28 NO
Need to identify after grouping by r_id whether r_id is present in the grouped rows of r_id, eg. when I group by 4 is found in one of the rows of org_id, hence i mark yes for the group 4, similarly for 20 is not found in the org_id column, hence i mark No for all the 20s group. Thanking you.
Upvotes: 2
Views: 1001
Reputation: 294488
Numpy
and pandas.factorize
This may seem convoluted. But I'm using Numpy
and keeping everything O(n)
Get arrays because I'll use them more than once
a = df.r_id.to_numpy()
b = df.org_id.to_numpy()
Factorizing something identifies each unique value with an integer starting with zero. pandas.factorize
will return a tuple of (factorized_integer_representation, unique_values)
. What's great about the factorization is that I can use those integers as positions within the unique values array to reproduce the original array. Namely r[i]
using the r
and i
below.
I could've also used numpy.unique
with the argument return_inverse
to get the same arrays, BUT pandas.factorize
doesn't sort the unique values and that is an order of O(log(n)) that we can save by not using. For larger data, pandas.factorize
is the winner.
I'll create a holding array that will house the boolean values on whether our any condition is satisfied for each unique value. numpy.logical_or.at
is the function we use to see if any values from a == b
is True
within the specified indices in i
.
I'll demonstrate after the code below.
i, r = pd.factorize(a)
o = np.zeros(len(r), bool)
np.logical_or.at(o, i, a == b)
df.assign(is_found=np.where(o, 'Yes', 'No')[i])
org_id flag r_id is_found
0 4 [4, 73] 4 Yes
1 73 [73] 4 Yes
2 6 [6, 77] 6 Yes
3 77 [77] 6 Yes
4 21 [21] 20 No
5 36 [36, 36] 20 No
6 40 [40] 20 No
7 22 [22, 41] 22 Yes
8 21 [21] 28 No
9 30 [22, 30] 28 No
10 31 [31, 31] 28 No
a == b
array([True, False, True, False, False, False, False, True, False, False, False])
r
are the unique values
r
array([ 4, 6, 20, 22, 28])
i
are the indices
i
array([0, 0, 1, 1, 2, 2, 2, 3, 4, 4, 4])
So r[i]
reproduces a
r[i]
array([ 4, 4, 6, 6, 20, 20, 20, 22, 28, 28, 28])
Now we start with a base array o
of all False, one for each unique value
array([False, False, False, False, False])
And for each position in i
, we check if the corresponding value in a == b
is True
.
# i, a == b -> 0, True <4 == 4>
# 0, False <4 != 73>
# ↓ 1, True <6 == 6>
# ↓ 1, False <6 != 77>
# ↓ ↓ 2, False <20 != 21>
# ↓ ↓ 2, False <20 != 36>
# ↓ ↓ 2, False <20 != 40>
# ↓ ↓ ↓ 3, True <22 == 22>
# ↓ ↓ ↓ ↓ 4, False <28 != 21>
# ↓ ↓ ↓ ↓ 4, False <28 != 30>
# ↓ ↓ ↓ ↓ 4, False <28 != 31>
# At least 1 True ↓ ↓ ↓ ↓ ↓
# o -> [ True, True, False, True, False]
Swap Yes
|No
instead of True
|False
# o -> [ True, True, False, True, False]
# np.where(o, 'Yes', 'No') -> [ 'Yes', 'Yes', 'No', 'Yes', 'No']
And slice it with i
to produce an array of the same length as original with the appropriate value for each corresponding value in the unique values array.
np.where(o, 'Yes', 'No')[i]
['Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'No', 'No']
Upvotes: 1
Reputation: 25259
Try this
d = {True: 'Yes', False: 'No'}
df['is_foundin_org_id'] = (df.org_id.eq(df.r_id).groupby(df.r_id)
.transform('max').map(d))
Out[1549]:
org_id flag r_id is_foundin_org_id
0 4 [4, 73] 4 Yes
1 73 [73] 4 Yes
2 6 [6, 77] 6 Yes
3 77 [77] 6 Yes
4 21 [21] 20 No
5 36 [36, 36] 20 No
6 40 [40] 20 No
7 22 [22, 41] 22 Yes
8 21 [21] 28 No
9 30 [22, 30] 28 No
10 31 [31, 31] 28 No
Upvotes: 3
Reputation: 150785
IIUC,
df['is_found'] = np.where(df['org_id'].eq(df['r_id']) # check if the ids are equal
.groupby(df['r_id']) # group by r_id
.transform('any'), # if True occurs within the groups
'yes', 'no')
Output:
org_id flag r_id is_found
0 4 [4, 73] 4 yes
1 73 [73] 4 yes
2 6 [6, 77] 6 yes
3 77 [77] 6 yes
4 21 [21] 20 no
5 36 [36, 36] 20 no
6 40 [40] 20 no
7 22 [22, 41] 22 yes
8 21 [21] 28 no
9 30 [22, 30] 28 no
10 31 [31, 31] 28 no
Upvotes: 3