Reputation: 83
The following code works for classifying a single message
total_frame['dummy_message'][total_frame['Message'].str.contains(['rrc'],case = False)] = 'msg1'
index Message
0 rrc
1 as1
2 as1
3 a2
4 as1
5 a2
However if I want to classify all the messages in a message column, I want to use something like this
total_frame['dummy_message'[total_frame['Message'].str.contains(['rrc','as1','as2','a2'],case = False)] = 'msg1','msg2','msg3','msg4'
This doesn't work as str.contains doesnt accept a list. The output should look something like this
index Message dummy message
0 rrc msg1
1 as1 msg2
2 as1 msg2
3 a2 msg4
4 as2 msg3
5 a2 msg4
Is there any alternative?
Upvotes: 0
Views: 36
Reputation: 402553
Initialise a mapping of substrings to categories, then use str.extract
to extract, and map
to classify them:
mapping = dict(zip(
['rrc', 'as1', 'as2', 'a2'],
['msg1', 'msg2', 'msg3', 'msg4']))
df['category'] = (
df['Message'].str.extract(r'(?i)({})'.format('|'.join(mapping)), expand=False)
.map(mapping))
If case-insensitivity is important, modify your regex as: r'(?i)({})'.format('|'.join(mapping))
.
Minimal Code Example
df = pd.DataFrame({'Message': ['this is as1', 'abcd rrc', 'xyz as2']})
df
Message
0 this is as1
1 abcd rrc
2 xyz as2
df['category'] = (
df['Message'].str.extract(r'({})'.format('|'.join(mapping)), expand=False)
.map(mapping))
df
Message category
0 this is as1 msg2
1 abcd rrc msg1
2 xyz as2 msg3
Upvotes: 1