Nastaran Khalili
Nastaran Khalili

Reputation: 56

Plot occurrences over time of specific words in a large dataset of texts (tweets) in Python

I need to plot the occurrence of a word over time for a pandas dataframe (time series) with a column of text.

The dataframe looks like this:

index,                date,       ... , text
2020-10-20 20:20:00 , 2020-10-20 ,... , "The text goes here"
.
.
.

What I want to have is a graph that shows the ocuurance of a specific word (for example "here") over time.

Here is what I currently have (It does the work but is so inefficient for large data and multiple words):

df['contains_word']=df['text'].str.contains('word')
df['contains_word']=df['contains_word'].replace(True, 1)
df['contains_word']=df['contains_word'].replace(False, 0)

g=df.groupby('date').contains_word.count()
plt.plot(g.index, g , c='r')
plt.xticks(rotation=90)
plt.title('xxx')
plt.show()

And here is the example output:

enter image description here

Upvotes: 1

Views: 461

Answers (1)

Ignacio Valenzuela
Ignacio Valenzuela

Reputation: 137

You seem to have an issue of volume rather than the timeseries code itself. Options for this could be to process the df['text'].str.contains('word') in parallel. I'd recommend swifter for parallel processing.

import swifter
def contains_word(word, dataframe, column):
    dataframe['contains_word']=dataframe['text'].str.contains(word)
    dataframe['contains_word']=dataframe['contains_word'].replace(True, 1)
    dataframe['contains_word']=dataframe['contains_word'].replace(False, 0)
    return dataframe['contains_word']
# example for the column text and word "here"
df.swifter.apply(lambda x: contains_word(word='here', df, 'text'), axis=0)

If that isn't enough I'd try to prepare the dataframe in a way that it's more efficient finding the required values.

Upvotes: 1

Related Questions