shsh
shsh

Reputation: 727

Python Categorize Dataframe Column Conditionally Using Regular Expression

I have a dataframe:

group   id
A   009x
A   010x
B   009x
B   002x
C   002x
C   003x

How do I make a new column new that categorizes conditionally under the following three conditions by group:

  1. If all id values consist of ONLY 009x and 010x, then categorize as g1
  2. If the id value is one of 009x or 010x AND another id value is not one of 009x or 010x, then categorize as g2
  3. Otherwise, just print the id value

Desired result:

group   id  new
A   009x    g1
A   010x    g1
B   009x    g2
B   002x    g2
C   002x    002x
C   003x    003x
data = {
    'group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'id': ['009x', '010x', '009x', '002x', '002x', '003x'], 
    }  
df = pd.DataFrame(data)  
df

Upvotes: 1

Views: 53

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195553

I hope I've understood your question right. You can use .groupby() + custom function:

def categorize_fn(x):
    tmp = x["id"].isin(["009x", "010x"])

    if tmp.all():
        x["new"] = "g1"
    elif tmp.any():
        x["new"] = "g2"
    else:
        x["new"] = x["id"]

    return x


df = df.groupby("group", group_keys=False).apply(categorize_fn)
print(df)

Prints:

  group    id   new
0     A  009x    g1
1     A  010x    g1
2     B  009x    g2
3     B  002x    g2
4     C  002x  002x
5     C  003x  003x

Upvotes: 1

Related Questions