Reputation: 391
Suppose you have three groups (A,B,C), and their components are outlined below:
+-------+-----------+
| Group | Component |
+-------+-----------+
| A | 31 |
| A | 322 |
| A | 323 |
| B | 321 |
| B | 327 |
| B | 33 |
| C | 31 |
| C | 32 |
| C | 33 |
+-------+-----------+
This could be stored in a dataframe or dictionary called 'groups'. Notice that there is overlap between the groups.
I have the following dataframe of values called 'df' (actual table is larger):
+--------+----------+--------+
| Sender | Receiver | Value |
+--------+----------+--------+
| 312345 | 313452 | 1022.1 |
| 320952 | 327901 | 4921.0 |
+--------+----------+--------+
I'd like to somehow join the 'groups' table with 'df' so that I have two more columns, each showing what group the Sender and Receiver are associated with based on if the Sender/Receiver code contains the Component. I'd only want to see the cases where both groups are the same. The target table would look like this:
+--------+----------+--------+--------------+----------------+
| Sender | Receiver | Value | Sender_Group | Receiver_Group |
+--------+----------+--------+--------------+----------------+
| 312345 | 313452 | 1022.1 | A | A |
| 312345 | 313452 | 1022.1 | C | C |
| 320952 | 327901 | 4921.0 | C | C |
+--------+----------+--------+--------------+----------------+
Notice how even though 327901 is associated with Group B, 320952 isn't so it's not shown. The final goal is to sum up the values by each group.
I've tried something like this:
pat = "|".join(groups.Component.astype('str'))
df.insert(0, 'Sender_Group', df['Sender'].str.extract("(" + pat + ')', expand=False))
df.insert(1, 'Receiver_Group', df['Receiver'].str.extract("(" + pat + ')', expand=False))
new_df = df.query('Sender_Group == Receiver_Group')
However, the limitation with this is that each Sender/Receiver can only be associated with one group. I need a solution that allows them to be associated with multiple. Any ideas?
Upvotes: 2
Views: 222
Reputation: 36289
You can use a column that represents all groups a sender (receiver) is part of as a list
. Then you can expand this column into multiple rows as discussed for this question.
df['sender_group'] = df['sender'].apply(
lambda x: groups['group'][groups['component'].apply(
lambda y: str(y) in str(x)
)].values
)
which gives
sender receiver value sender_group
0 312345 313452 1022.1 [A, C]
1 320952 327901 4921.0 [C]
For the receiver the procedure is similar.
For ways to expand the column please refer to this question. An example is given here:
indices = np.repeat(df.index.values, df['sender_group'].str.len())
df = df.loc[indices]\
.assign(sender_group=np.concatenate(df['sender_group'].values))\
.reset_index(drop=True)
which gives
sender receiver value sender_group
0 312345 313452 1022.1 A
1 312345 313452 1022.1 C
2 320952 327901 4921.0 C
Variables used:
groups = pd.DataFrame(
columns=['group', 'component'],
data=[['A', 31],
['A', 322],
['A', 323],
['B', 321],
['B', 327],
['B', 33],
['C', 31],
['C', 32],
['C', 33],]
)
df = pd.DataFrame(
columns=['sender', 'receiver', 'value'],
data=[[312345, 313452, 1022.1],
[320952, 327901, 4921.0]]
)
Upvotes: 1