Reputation: 604
I have column of data that contains text and a list of individual words that I want to match with the text column and sum the number of times the words appear in each row of the column.
Here's an example:
wordlist = ['alaska', 'france', 'italy']
test = pd.read_csv('vacation text.csv')
test.head(4)
Index Text
0 'he's going to alaska and france'
1 'want to go to italy next summer'
2 'germany is great!'
4 'her parents are from france and alaska but she lives in alaska'
I tried using the following code:
test['count'] = pd.Series(test.text.str.count(r).sum() for r in wordlist)
And this code:
test['count'] = pd.Series(test.text.str.contains(r).sum() for r in wordlist)
The problem is that the sums don't seem to accurately reflect the number of words in the text
column. I noticed this when I, again using my example, added germany
to my list and then the sum didn't change from 0 to 1.
Ultimately I want my data to look like:
Index Text Count
0 'he's going to alaska and france' 2
1 'want to go to italy next summer' 1
2 'germany is great!' 0
4 'her folks are from france and italy but she lives in alaska' 3
Does anyone know how any additional approaches?
Upvotes: 0
Views: 50
Reputation: 77027
One way would be using str.count
In [792]: test['Text'].str.count('|'.join(wordlist))
Out[792]:
0 2
1 1
2 0
3 3
Name: Text, dtype: int64
Another way, sum
of individual word counts
In [802]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist}).sum(1)
Out[802]:
0 2
1 1
2 0
3 3
dtype: int64
Details
In [804]: '|'.join(wordlist)
Out[804]: 'alaska|france|italy'
In [805]: pd.DataFrame({w:test['Text'].str.count(w) for w in wordlist})
Out[805]:
alaska france italy
0 1 1 0
1 0 0 1
2 0 0 0
3 2 1 0
Upvotes: 1