Reputation: 105
I have a dataset that looks like this:
| ColumnA | ColumnB |ColumnZ |
| --------| -------------- |--------|
| 1 | locationA |324 |
| 1 | n.a. |34 |
| 2 | n.a. |21 |
| 2 | locationA |n.a. |
| 2 | locationA |34 |
| 2 | n.a. |12 |
| 3 | n.a. |1 |
| 3 | locationB |134 |
| 3 | n.a. |n.a. |
| 4 | n.a. |134 |
| 4 | locationC |n.a. |
| 4 | locationD |132 |
| 4 | locationD |n.a. |
I now want to add a new ColumnC
, in which is stated "different locations", when more than 1 location is in ColumnB
that belong to the same group (i.e. same number) in ColumnA
. So my desired output is:
| ColumnA | ColumnB | ColumnZ | ColumnC |
| --------| -------------- | --------| ----------------- |
| 1 | locationA | 324 | |
| 1 | n.a. | 34 | |
| 2 | n.a. | 21 | |
| 2 | locationA | n.a. | |
| 2 | locationA | 34 | |
| 2 | n.a. | 12 | |
| 3 | n.a. | 1 | |
| 3 | locationB | 134 | |
| 3 | n.a. | n.a. | |
| 4 | n.a. | 134 | different locations |
| 4 | locationC | n.a. | different locations |
| 4 | locationD | 132 | different locations |
| 4 | locationD | n.a. | different locations |
Therefore I've started with turning all n.a.
values in ColumnB
to NaN
values:
df['ColumnB'] = df['ColumnB'].replace('n.a.', np.NaN)
and then I've tried it with this function:
def no_of_locations(group):
if df['ColumnB'].nunique() > 1:
df['ColumnC'] = 'different locations'
pass
df.groupby('ColumnA').apply(no_of_locations)
Yet, the result is that it still counts all unique values in the whole ColumnB
, not only in the group based on ColumnA
. How can I restrict it on the respective group?
Upvotes: 2
Views: 72
Reputation: 1490
If only condition is having repeating ColumnA after dropping na values you can just count the number of ColumnA values and use it to mask and filter your original dataframe
mask = df['ColumnA'].isin((df.replace({'ColumnB' : {'n.a.': np.nan}})
.dropna(subset=['ColumnB'])
.groupby('ColumnA')['ColumnB'].nunique()
.loc[lambda x: x>1].index.values
)
df.loc[mask, 'ColumnC'] = 'different_locations'
Upvotes: 1