vinsent paramanantham
vinsent paramanantham

Reputation: 951

check if the group id or element is found the column list in pandas data frame

data = {

    'org_id' :[4,73,6,77,21,36,40,22,21,30,31],
    'flag': [['4', '73'],['73'],['6', '77'],['77'],['21'],['36', '36'],['40'],['22', '41'],['21'],['22', '30'],['31', '31']],
    'r_id' : [4,4,6,6,20,20,20,22,28,28,28]


    }

df = pd.DataFrame.from_dict(data)
df

required data frame to be like below,

data = {

    'org_id' :[4,73,6,77,21,36,40,22,21,30,31],
    'flag': [['4', '73'],['73'],['6', '77'],['77'],['21'],['36', '36'],['40'],['22', '41'],['21'],['22', '30'],['31', '31']],
    'r_id' : [4,4,6,6,20,20,20,22,28,28,28],
    'is_foundin_org_id': ['yes','yes','yes','yes','NO','NO','NO','yes','NO','NO','NO']

    }

df2 = pd.DataFrame.from_dict(data)
df2

output data frame

Out[115]: 
    org_id      flag  r_id is_foundin_org_id
0        4   [4, 73]     4               yes
1       73      [73]     4               yes
2        6   [6, 77]     6               yes
3       77      [77]     6               yes
4       21      [21]    20                NO
5       36  [36, 36]    20                NO
6       40      [40]    20                NO
7       22  [22, 41]    22               yes
8       21      [21]    28                NO
9       30  [22, 30]    28                NO
10      31  [31, 31]    28                NO

Need to identify after grouping by r_id whether r_id is present in the grouped rows of r_id, eg. when I group by 4 is found in one of the rows of org_id, hence i mark yes for the group 4, similarly for 20 is not found in the org_id column, hence i mark No for all the 20s group. Thanking you.

Upvotes: 2

Views: 1001

Answers (3)

piRSquared
piRSquared

Reputation: 294488

Numpy and pandas.factorize

This may seem convoluted. But I'm using Numpy and keeping everything O(n)

Get arrays because I'll use them more than once

a = df.r_id.to_numpy()
b = df.org_id.to_numpy()

Factorizing something identifies each unique value with an integer starting with zero. pandas.factorize will return a tuple of (factorized_integer_representation, unique_values). What's great about the factorization is that I can use those integers as positions within the unique values array to reproduce the original array. Namely r[i] using the r and i below.

I could've also used numpy.unique with the argument return_inverse to get the same arrays, BUT pandas.factorize doesn't sort the unique values and that is an order of O(log(n)) that we can save by not using. For larger data, pandas.factorize is the winner.

I'll create a holding array that will house the boolean values on whether our any condition is satisfied for each unique value. numpy.logical_or.at is the function we use to see if any values from a == b is True within the specified indices in i.

I'll demonstrate after the code below.

i, r = pd.factorize(a)
o = np.zeros(len(r), bool)
np.logical_or.at(o, i, a == b)

df.assign(is_found=np.where(o, 'Yes', 'No')[i])

    org_id      flag  r_id is_found
0        4   [4, 73]     4      Yes
1       73      [73]     4      Yes
2        6   [6, 77]     6      Yes
3       77      [77]     6      Yes
4       21      [21]    20       No
5       36  [36, 36]    20       No
6       40      [40]    20       No
7       22  [22, 41]    22      Yes
8       21      [21]    28       No
9       30  [22, 30]    28       No
10      31  [31, 31]    28       No

Details

a == b

array([True, False, True, False, False, False, False, True, False, False, False])

r are the unique values

r

array([ 4,  6, 20, 22, 28])

i are the indices

i

array([0, 0, 1, 1, 2, 2, 2, 3, 4, 4, 4])

So r[i] reproduces a

r[i]

array([ 4,  4,  6,  6, 20, 20, 20, 22, 28, 28, 28])

Now we start with a base array o of all False, one for each unique value

array([False, False, False, False, False])

And for each position in i, we check if the corresponding value in a == b is True.

#  i, a == b ->  0,  True <4 == 4>
#                0, False <4 != 73>
#                     ↓    1,  True <6 == 6>
#                     ↓    1, False <6 != 77>
#                     ↓         ↓    2, False <20 != 21>
#                     ↓         ↓    2, False <20 != 36>
#                     ↓         ↓    2, False <20 != 40>
#                     ↓         ↓         ↓    3,  True <22 == 22>
#                     ↓         ↓         ↓         ↓    4, False <28 != 21>
#                     ↓         ↓         ↓         ↓    4, False <28 != 30>
#                     ↓         ↓         ↓         ↓    4, False <28 != 31>
#  At least 1 True    ↓         ↓         ↓         ↓         ↓
#  o         -> [    True,     True,    False,     True,    False]

Swap Yes|No instead of True|False

#  o                         -> [    True,     True,    False,     True,    False]
#  np.where(o, 'Yes', 'No')  -> [   'Yes',    'Yes',     'No',    'Yes',     'No']

And slice it with i to produce an array of the same length as original with the appropriate value for each corresponding value in the unique values array.

np.where(o, 'Yes', 'No')[i]

['Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'No', 'No']

Upvotes: 1

Andy L.
Andy L.

Reputation: 25259

Try this

d = {True: 'Yes', False: 'No'}
df['is_foundin_org_id'] = (df.org_id.eq(df.r_id).groupby(df.r_id)
                             .transform('max').map(d))

Out[1549]:
    org_id      flag  r_id is_foundin_org_id
0   4       [4, 73]   4     Yes
1   73      [73]      4     Yes
2   6       [6, 77]   6     Yes
3   77      [77]      6     Yes
4   21      [21]      20    No
5   36      [36, 36]  20    No
6   40      [40]      20    No
7   22      [22, 41]  22    Yes
8   21      [21]      28    No
9   30      [22, 30]  28    No
10  31      [31, 31]  28    No

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150785

IIUC,

df['is_found'] = np.where(df['org_id'].eq(df['r_id']) # check if the ids are equal
                              .groupby(df['r_id'])    # group by r_id
                              .transform('any'),      # if True occurs within the groups
                          'yes', 'no')

Output:

    org_id      flag  r_id is_found
0        4   [4, 73]     4      yes
1       73      [73]     4      yes
2        6   [6, 77]     6      yes
3       77      [77]     6      yes
4       21      [21]    20       no
5       36  [36, 36]    20       no
6       40      [40]    20       no
7       22  [22, 41]    22      yes
8       21      [21]    28       no
9       30  [22, 30]    28       no
10      31  [31, 31]    28       no

Upvotes: 3

Related Questions