Reputation: 199
I have a dataframe that looks like this below.
Name M_Name Tag
John 1
Dave a 1
Mary 1
Sam 1
Chris a 1
John 2
Nola f 2
Chuck 2
Rob 2
Chris a 2
Angie 3
Joe 3
: : :
Tony n
I need to create a new column Tag_2, which is based on Tag column and M_Name column. First, I need to groupby Tag column then if M_Name column have more than 1 different letter, the whole Tag group will have a new value "Invalid" under Tag_2 column. If it only has 1 letter in the M-Name for the same Tag group then it will be "Valid". Here is what the output should look like.
Name M_Name Tag Tag_2
John 1 Valid
Dave a 1 Valid
Mary 1 Valid
Sam 1 Valid
Chris a 1 Valid
John 2 Invalid
Nola f 2 Invalid
Chuck 2 Invalid
Rob 2 Invalid
Chris a 2 Invalid
Angie 3 Valid
Joe 3 Valid
: : :
Tony n Valid
Tag 2 group is "Invalid" because the M_Name rows has more than 1 different letter (f & a) for the tag 2 group.
I am thinking to use groupby to solve this task but I do not know how to add a condition like this. df.groupby('Tag')['M_Name']
I am wondering am I on the right path using groupby or are ther any better approach that I should consider?
Thanks!
Upvotes: 1
Views: 1410
Reputation: 7627
First, we group()
, then we transform()
the M_Name
column, where in lambda
we filter out empty values x[x.ne('')]
, then use value_counts()
to get the elements frequency, and based on the number of elements, we determine the validity:
df = pd.DataFrame(
{'Name': ['John', 'Dave', 'Mary', 'Sam', 'Chris', 'John', 'Nola', 'Chuck', 'Rob', 'Chris', 'Angie', 'Joe'],
'M_Name': ['', 'a', '', '', 'a', '', 'f', '', '', 'a', '', ''], 'Tag': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3]})
df['Tag2'] = df.groupby('Tag').M_Name.transform(
lambda x: 'Invalid' if x[x.ne('')].value_counts().shape[0] > 1 else 'Valid')
print(df)
Output:
Name M_Name Tag Tag2
0 John 1 Valid
1 Dave a 1 Valid
2 Mary 1 Valid
3 Sam 1 Valid
4 Chris a 1 Valid
5 John 2 Invalid
6 Nola f 2 Invalid
7 Chuck 2 Invalid
8 Rob 2 Invalid
9 Chris a 2 Invalid
10 Angie 3 Valid
11 Joe 3 Valid
Upvotes: 2
Reputation: 8952
I'm sure there's a nicer way, but here's one solution:
def validate(s):
return "Invalid" if len(set(s)) > 2 else "Valid"
mapping = df.groupby("Tag")["M_Name"].apply(validate)
df["Tag_2"] = df.loc[:, "Tag"].replace(mapping)
Output:
Name M_Name Tag Tag_2
0 John 1 Valid
1 Dave a 1 Valid
2 Mary 1 Valid
3 Sam 1 Valid
4 Chris a 1 Valid
5 John 2 Invalid
6 Nola f 2 Invalid
7 Chuck 2 Invalid
8 Rob 2 Invalid
9 Chris a 2 Invalid
10 Angie 3 Valid
11 Joe 3 Valid
Upvotes: 3