jerbear
jerbear

Reputation: 391

Match columns based on substring

Suppose you have three groups (A,B,C), and their components are outlined below:

 +-------+-----------+
 | Group | Component |
 +-------+-----------+
 | A     |        31 |
 | A     |       322 |
 | A     |       323 |
 | B     |       321 |
 | B     |       327 |
 | B     |        33 |
 | C     |        31 |
 | C     |        32 |
 | C     |        33 |
 +-------+-----------+

This could be stored in a dataframe or dictionary called 'groups'. Notice that there is overlap between the groups.

I have the following dataframe of values called 'df' (actual table is larger):

 +--------+----------+--------+
 | Sender | Receiver | Value  |
 +--------+----------+--------+
 | 312345 |   313452 | 1022.1 |
 | 320952 |   327901 | 4921.0 |
 +--------+----------+--------+

I'd like to somehow join the 'groups' table with 'df' so that I have two more columns, each showing what group the Sender and Receiver are associated with based on if the Sender/Receiver code contains the Component. I'd only want to see the cases where both groups are the same. The target table would look like this:

 +--------+----------+--------+--------------+----------------+
 | Sender | Receiver | Value  | Sender_Group | Receiver_Group |
 +--------+----------+--------+--------------+----------------+
 | 312345 |   313452 | 1022.1 | A            | A              |
 | 312345 |   313452 | 1022.1 | C            | C              |
 | 320952 |   327901 | 4921.0 | C            | C              |
 +--------+----------+--------+--------------+----------------+

Notice how even though 327901 is associated with Group B, 320952 isn't so it's not shown. The final goal is to sum up the values by each group.

I've tried something like this:

pat = "|".join(groups.Component.astype('str'))
df.insert(0, 'Sender_Group', df['Sender'].str.extract("(" + pat + ')', expand=False))
df.insert(1, 'Receiver_Group', df['Receiver'].str.extract("(" + pat + ')', expand=False))
new_df = df.query('Sender_Group == Receiver_Group')

However, the limitation with this is that each Sender/Receiver can only be associated with one group. I need a solution that allows them to be associated with multiple. Any ideas?

Upvotes: 2

Views: 222

Answers (1)

a_guest
a_guest

Reputation: 36289

You can use a column that represents all groups a sender (receiver) is part of as a list. Then you can expand this column into multiple rows as discussed for this question.

df['sender_group'] = df['sender'].apply(
    lambda x: groups['group'][groups['component'].apply(
        lambda y: str(y) in str(x)
    )].values
)

which gives

   sender  receiver   value sender_group
0  312345    313452  1022.1       [A, C]
1  320952    327901  4921.0          [C]

For the receiver the procedure is similar.

Expansion

For ways to expand the column please refer to this question. An example is given here:

indices = np.repeat(df.index.values, df['sender_group'].str.len())
df = df.loc[indices]\
    .assign(sender_group=np.concatenate(df['sender_group'].values))\
    .reset_index(drop=True)

which gives

   sender  receiver   value sender_group
0  312345    313452  1022.1            A
1  312345    313452  1022.1            C
2  320952    327901  4921.0            C

Variables used:

groups = pd.DataFrame(
    columns=['group', 'component'],
    data=[['A', 31],
          ['A', 322],
          ['A', 323],
          ['B', 321],
          ['B', 327],
          ['B', 33],
          ['C', 31],
          ['C', 32],
          ['C', 33],]
)

df = pd.DataFrame(
    columns=['sender', 'receiver', 'value'],
    data=[[312345, 313452, 1022.1],
          [320952, 327901, 4921.0]]
)

Upvotes: 1

Related Questions