pandas Dataframe: create new columns of labels based on other columns

Question

I have this example pandas.DataFrame with +20K rows, in the following form:

import pandas as pd
import numpy as np

data = {"first_column": ["A", "B", "B", "B", "C", "A", "A", "A", "D", "B", "A", "A"],
        "second_column": [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]}

df = pd.DataFrame(data)

>>> df
   first_column  second_column
0             A              0
1             B              1
2             B              1
3             B              1
4             C              0
5             A              0
6             A              0
7             A              1
8             D              1
9             B              1
10            A              1
11            A              0
....

The column first_column has for each row A, B, C, and D. In the second column, there is a binary label denoting a group of values. All consecutive groupings of 1's are a unique "group", e.g. rows 1-3 is one group, rows 7-10 is another group.

I would like to "label" each one of these groups by being either "AB" (the group is only composed of A or B), "CD" (the group is only composed of C or D), or "mixed" (if there is a mixture, e.g. all B and one C). It would also be useful to know "how" mixed some of these groupings are with a percentage, i.e. the percentage of AB's out of total labels. So, if it is only A or B, the identity should be AB. If it is only C or D, the identity should be CD. It is a mixture of A,B,C, and/or D, then it is mixed. The percentage is (# of AB rows)/(# of total rows)

Here is how the resulting DataFrame would look:

>>> df
   first_column  second_column    identity    percent
0             A              0           0          0
1             B              1          AB        1.0
2             B              1          AB        1.0
3             B              1          AB        1.0
4             C              0           0          0
5             A              0           0          0
6             A              0           0          0
7             A              1       mixed       0.75  # 3/4, 3-AB, 4-total
8             D              1       mixed       0.75
9             B              1       mixed       0.75
10            A              1       mixed       0.75
11            A              0           0          0 
....

My initial thought would be to first try to use df.loc() with

if (df.first_column == "A" | df.first_column == "B"):
    df.loc[df.second_column == 1, "identity"] = "AB"
if (df.first_column == "C" | df.first_column == "D"):
    df.loc[df.second_column == 1, "identity"] = "CD"

but this doesn't take into account mixtures, nor does it work with isolated groupings.

Stephen Rauch · Accepted Answer

Here is one way to do that.

Code:

import pandas as pd

from collections import Counter
a_b = set('AB')
c_d = set('CD')

def get_id_percent(group):
    present = Counter(group['first_column'])
    present_set = set(present.keys())

    if group['second_column'].iloc[0] == 0:
        ret_val = 0, 0
    elif present_set.issubset(a_b) and len(present_set) == 1:
        ret_val = 'AB', 0
    elif present_set.issubset(c_d) and len(present_set) == 1:
        ret_val = 'CD', 0
    else:
        ret_val = 'mixed', \
               float(present['A'] + present['B']) / len(group)

    return pd.DataFrame(
        [ret_val] * len(group), columns=['identity', 'percent'])

Test Code:

data = {"first_column": ["A", "B", "B", "B", "C", "A", "A",
                         "A", "D", "B", "A", "A"],
        "second_column": [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]}

df = pd.DataFrame(data)

groupby = df.groupby((df.second_column != df.second_column.shift()).cumsum())

results = groupby.apply(get_id_percent).reset_index()
results = results.drop(['second_column', 'level_1'], axis=1)
df = pd.concat([df, results], axis=1)
print(df)

Results:

   first_column  second_column identity  percent
0             A              0        0     0.00
1             B              1       AB     0.00
2             B              1       AB     0.00
3             B              1       AB     0.00
4             C              0        0     0.00
5             A              0        0     0.00
6             A              0        0     0.00
7             A              1    mixed     0.75
8             D              1    mixed     0.75
9             B              1    mixed     0.75
10            A              1    mixed     0.75
11            A              0        0     0.00

pandas Dataframe: create new columns of labels based on other columns

Answers (2)

Related Questions