Reputation: 55
I'm kinda new to python dataframes, so this might sound really easy. I have a column called 'body_text' in a dataframe and I want to see if each row of body_text contains the word "Hello". And if it does, I want to make another column that has 1 or 0 as the values.
I tried using str.contains("Hello")
but that made an error where it only selected the rows that had "Hello" and attempted to put it in another column.
I tried looking at other solutions that just ended up in more errors - for loops, and str in str.
textdf = traindf[['request_title','request_text_edit_aware']]
traindf is a huge dataframe that I'm only pulling 2 columns from
Upvotes: 1
Views: 3677
Reputation: 18647
If your match is case-sensitive, use Series.str.contains
and chain on .astype
to cast as int
:
df['contains_hello'] = df['body_text'].str.contains('Hello').astype(int)
If it should match, case-insensitive, added the case=False
argument:
df['contains_hello'] = df['body_text'].str.contains('Hello', case=False).astype(int)
If you need to match multiple patterns, use regex
with |
('OR') character. You may also need a a 'word boundary' character as well depending on your requirements.
Regexr is a good resource if you want to learn more about regex
patterns and character classes.
df = pd.DataFrame({'body_text': ['no matches here', 'Hello, this should match', 'high low - dont match', 'oh hi there - match me']})
# body_text
# 0 no matches here
# 1 Hello, this should match <-- we want to match this 'Hello'
# 2 high low - dont match <-- 'hi' exists in 'high', but we don't want to match it
# 3 oh hi there - match me <-- we want to match 'hi' here
df['contains_hello'] = df['body_text'].str.contains(r'Hello|\bhi\b', regex=True).astype(int)
body_text contains_hello
0 no matches here 0
1 Hello, this should match 1
2 high low - dont match 0
3 oh hi there - match me 1
Sometimes it's useful to have a list
of words you want to match, to create a regex
pattern more easily with a python list comprehension
. For example:
match = ['hello', 'hi']
pat = '|'.join([fr'\b{x}\b' for x in match])
# '\bhello\b|\bhi\b' - meaning 'hello' OR 'hi'
df.body_text.str.contains(pat)
Upvotes: 1
Reputation: 1
You can use get_dummies()
function in Panda.
Here is the link to documentation.
Upvotes: 0
Reputation: 34657
With textdf as you've defined in your question, try:
textdf['new_column'] = [1 if t == 'Hello' else 0 for t in textdf['body_text'] ]
Upvotes: 0