Reputation: 56
I need to plot the occurrence of a word over time for a pandas dataframe (time series) with a column of text.
The dataframe looks like this:
index, date, ... , text
2020-10-20 20:20:00 , 2020-10-20 ,... , "The text goes here"
.
.
.
What I want to have is a graph that shows the ocuurance of a specific word (for example "here") over time.
Here is what I currently have (It does the work but is so inefficient for large data and multiple words):
df['contains_word']=df['text'].str.contains('word')
df['contains_word']=df['contains_word'].replace(True, 1)
df['contains_word']=df['contains_word'].replace(False, 0)
g=df.groupby('date').contains_word.count()
plt.plot(g.index, g , c='r')
plt.xticks(rotation=90)
plt.title('xxx')
plt.show()
And here is the example output:
Upvotes: 1
Views: 461
Reputation: 137
You seem to have an issue of volume rather than the timeseries code itself. Options for this could be to process the df['text'].str.contains('word')
in parallel. I'd recommend swifter for parallel processing.
import swifter
def contains_word(word, dataframe, column):
dataframe['contains_word']=dataframe['text'].str.contains(word)
dataframe['contains_word']=dataframe['contains_word'].replace(True, 1)
dataframe['contains_word']=dataframe['contains_word'].replace(False, 0)
return dataframe['contains_word']
# example for the column text and word "here"
df.swifter.apply(lambda x: contains_word(word='here', df, 'text'), axis=0)
If that isn't enough I'd try to prepare the dataframe in a way that it's more efficient finding the required values.
Upvotes: 1