Reputation: 121
I have a question about word count using python.
Data Frame have three columns.(id, text, word)
First, This is example table.
[Data Frame]
df = pd.DataFrame({
"id":[
"100",
"200",
"300"
],
"text":[
"The best part of Zillow is you can search/view thousands of home within a click of a button without even stepping out of your door.At the comfort of your home you can get all the details such as the floor plan, tax history, neighborhood, mortgage calculator, school ratings etc. and also getting in touch with the contact realtor is just a click away and you are scheduled for the home tour!As a first time home buyer, this website greatly helped me to study the market before making the right choice.",
"I love all of the features of the Zillow app, especially the filtering options and the feature that allows you to save customized searches.",
"Data is not updated spontaneously. Listings are still shown as active while the Mls shows pending or closed."
],
"word":[
"[best, word, door, subway, rain]",
"[item, best, school, store, hospital]",
"[gym, mall, pool, playground]",
]
})
I already split text to make dictionary.
So, I want to each line word list checked to text.
This is result what I want.
| id | word dict |
| -- | ----------------------------------------------- |
| 100| {best: 1, word: 0, door: 1, subway: 0 , rain: 0} |
| 200| {item: 0, best: 0, school: 0, store: 0, hospital: 0} |
| 300| {gym: 0, mall: 0, pool: 0, playground: 0} |
Please, check this issue.
Upvotes: 0
Views: 248
Reputation: 18426
Since your word column is of type string, convert it to a list first:
df['word'] = df['word'].str[1:-1].str.split(',')
Now you can use apply for axis=1
with the logic to count each word:
df[['text', 'word']].apply(lambda row: {item:row['text'].count(item) for item in row['word']}, axis=1)
OUTPUT:
Out[32]:
0 {'best': 1, ' word': 0, ' door': 1, ' subway':...
1 {'item': 0, ' best': 0, ' school': 0, ' store'...
2 {'gym': 0, ' mall': 0, ' pool': 0, ' playgroun...
dtype: object
Upvotes: 1
Reputation: 5746
We can use re
to extract all of the words in our list
. Noting, this will only match words in your list, not numbers.
Then apply a function that returns a dict
with the count of each word in the list. We can then apply this function to a new column in the df
.
import re
def count_words(row):
words = re.findall(r'(\w+)', row['word'])
return {word: row['text'].count(word) for word in words}
df['word_counts'] = df.apply(lambda x: count_words(x), axis=1)
Outputs
id ... word_counts
0 100 ... {'best': 1, 'word': 0, 'door': 1, 'subway': 0,...
1 200 ... {'item': 0, 'best': 0, 'school': 0, 'store': 0...
2 300 ... {'gym': 0, 'mall': 0, 'pool': 0, 'playground': 0}
[3 rows x 4 columns]
Upvotes: 1