Mei Tei
Mei Tei

Reputation: 55

Python Pandas going through an entire column and checking if it contains a certain str

I'm kinda new to python dataframes, so this might sound really easy. I have a column called 'body_text' in a dataframe and I want to see if each row of body_text contains the word "Hello". And if it does, I want to make another column that has 1 or 0 as the values.

I tried using str.contains("Hello") but that made an error where it only selected the rows that had "Hello" and attempted to put it in another column. I tried looking at other solutions that just ended up in more errors - for loops, and str in str.

textdf = traindf[['request_title','request_text_edit_aware']]
traindf is a huge dataframe that I'm only pulling 2 columns from

Upvotes: 1

Views: 3677

Answers (3)

Chris Adams
Chris Adams

Reputation: 18647

If your match is case-sensitive, use Series.str.contains and chain on .astype to cast as int:

df['contains_hello'] = df['body_text'].str.contains('Hello').astype(int)

If it should match, case-insensitive, added the case=False argument:

df['contains_hello'] = df['body_text'].str.contains('Hello', case=False).astype(int)

Update

If you need to match multiple patterns, use regex with | ('OR') character. You may also need a a 'word boundary' character as well depending on your requirements.

Regexr is a good resource if you want to learn more about regex patterns and character classes.

Example

df = pd.DataFrame({'body_text': ['no matches here', 'Hello, this should match', 'high low - dont match', 'oh hi there - match me']})

#                      body_text
#    0           no matches here   
#    1  Hello, this should match   <--  we want to match this 'Hello'
#    2     high low - dont match   <-- 'hi' exists in 'high', but we don't want to match it
#    3    oh hi there - match me   <--  we want to match 'hi' here

df['contains_hello'] = df['body_text'].str.contains(r'Hello|\bhi\b', regex=True).astype(int)

                  body_text  contains_hello
0           no matches here               0
1  Hello, this should match               1
2     high low - dont match               0
3    oh hi there - match me               1

Sometimes it's useful to have a list of words you want to match, to create a regex pattern more easily with a python list comprehension. For example:

match = ['hello', 'hi']    
pat = '|'.join([fr'\b{x}\b' for x in match])
# '\bhello\b|\bhi\b'  -  meaning 'hello' OR 'hi'

df.body_text.str.contains(pat)

Upvotes: 1

MountainKing
MountainKing

Reputation: 1

You can use get_dummies() function in Panda.

Here is the link to documentation.

Upvotes: 0

hd1
hd1

Reputation: 34657

With textdf as you've defined in your question, try:

textdf['new_column'] = [1 if t == 'Hello' else 0 for t in textdf['body_text'] ]

Upvotes: 0

Related Questions