Reputation: 3728
I have a DataFrame
contains index
and text
columns.
For example:
index | text
1 | "I have a pen, but I lost it today."
2 | "I have pineapple and pen, but I lost it today."
Now I have a long list, and I want to match each of the words in text
with the list.
Let's say:
long_list = ['pen', 'pineapple']
I would want to create a FunctionTransformer
to match words in the long_list
with each word of the column value, if there is a match, return the count.
index | text | count
1 | "I have a pen, but I lost it today." | 1
2 | "I have pineapple and pen, but I lost it today." | 2
I did in this way:
def count_words(df):
long_list = ['pen', 'pineapple']
count = 0
for c in df['tweet_text']:
if c in long_list:
count = count + 1
df['count'] = count
return df
count_word = FunctionTransformer(count_words, validate=False)
An example of how I develop my other FunctionTransformer
will be:
def convert_twitter_datetime(df):
df['hour'] = pd.to_datetime(df['created_at'], format='%a %b %d %H:%M:%S +0000 %Y').dt.strftime('%H').astype(int)
return df
convert_datetime = FunctionTransformer(convert_twitter_datetime, validate=False)
Upvotes: 0
Views: 1193
Reputation: 3224
Inspired by @Quang Hoang's answer
import pandas as pd
import sklearn as sk
y=['pen', 'pineapple']
def count_strings(X, y):
pattern = r'\b{}\b'.format('|'.join(y))
return X['text'].str.count(pattern)
string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)
results in
text count
1 "I have a pen, but I lost it today." 1
2 "I have pineapple and pen, but I lost it today. 2
And for the following df2
:
#df2
text
1 "I have a pen, but I lost it today. pen pen"
2 "I have pineapple and pen, but I lost it today."
We get
string_transformer.transform(X=df2)
#result
1 3
2 2
Name: text, dtype: int64
This shows, that we converted the function to an sklearn
-style object. To abstact this even further we can hand over the column name as key-word argument to count_strings
.
Upvotes: 0
Reputation: 26676
Join elements in a list with with |
. Find matching elements with .str.findall()
and apply .str.len()
for count
p='|'.join(long_list)
df=df.assign(count=(df.text.str.findall(p)).str.len())
text count
0 "I have a pen, but I lost it today." 1
1 "I have pineapple and pen, but I lost it today." 2
Upvotes: 0
Reputation: 150765
Pandas has str.count
:
# matching any of the words
pattern = r'\b{}\b'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)
Output:
index text count
0 1 "I have a pen, but I lost it today." 1
1 2 "I have pineapple and pen, but I lost it today." 2
Upvotes: 2