Create new column based on number of different values in column grouped by another column

Question

I have a dataset that looks like this:

| ColumnA | ColumnB         |ColumnZ |
| --------| --------------  |--------|
| 1       | locationA       |324     |
| 1       | n.a.            |34      |
| 2       | n.a.            |21      |
| 2       | locationA       |n.a.    |
| 2       | locationA       |34      |
| 2       | n.a.            |12      |
| 3       | n.a.            |1       |
| 3       | locationB       |134     |
| 3       | n.a.            |n.a.    |
| 4       | n.a.            |134     |
| 4       | locationC       |n.a.    |
| 4       | locationD       |132     |
| 4       | locationD       |n.a.    |

I now want to add a new ColumnC, in which is stated "different locations", when more than 1 location is in ColumnB that belong to the same group (i.e. same number) in ColumnA. So my desired output is:

| ColumnA | ColumnB         | ColumnZ | ColumnC             |
| --------| --------------  | --------| -----------------   |
| 1       | locationA       | 324     |                     |
| 1       | n.a.            | 34      |                     |
| 2       | n.a.            | 21      |                     |
| 2       | locationA       | n.a.    |                     |
| 2       | locationA       | 34      |                     |
| 2       | n.a.            | 12      |                     |
| 3       | n.a.            | 1       |                     |
| 3       | locationB       | 134     |                     |
| 3       | n.a.            | n.a.    |                     |
| 4       | n.a.            | 134     | different locations |
| 4       | locationC       | n.a.    | different locations |
| 4       | locationD       | 132     | different locations |
| 4       | locationD       | n.a.    | different locations |

Therefore I've started with turning all n.a. values in ColumnB to NaN values:

df['ColumnB'] = df['ColumnB'].replace('n.a.', np.NaN)

and then I've tried it with this function:

def no_of_locations(group):
    if df['ColumnB'].nunique() > 1:
        df['ColumnC'] = 'different locations'
    pass

df.groupby('ColumnA').apply(no_of_locations)

Yet, the result is that it still counts all unique values in the whole ColumnB, not only in the group based on ColumnA. How can I restrict it on the respective group?

alparslan mimaroğlu · Accepted Answer

If only condition is having repeating ColumnA after dropping na values you can just count the number of ColumnA values and use it to mask and filter your original dataframe

mask = df['ColumnA'].isin((df.replace({'ColumnB' : {'n.a.': np.nan}})
                        .dropna(subset=['ColumnB'])
                        .groupby('ColumnA')['ColumnB'].nunique()
                        .loc[lambda x: x>1].index.values
)


df.loc[mask, 'ColumnC'] = 'different_locations'

Create new column based on number of different values in column grouped by another column

Answers (1)

Related Questions