Reputation:
I have a dataset
Column1 Column2 Column3 ....
2020/05/02 She heard the gurgling water (not relevant)
2020/05/02 The water felt delightful
2020/05/03 Another instant and I shall never again see the sun, this water, that gorge!
2020/05/04 Fire would have been her choice.
2020/05/04 Everywhere you go in world are water fountains.
...
2020/05/31 She spelled "mother" several times.
I would like to plot the frequency of word 'water' through time. How could I do?
What I have tried is defining a pattern:
pattern=['water']
and apply re.search
:
df['Column2'] = df['Column2'].apply(lambda x: re.search(pattern,x).group(1))
to select the word water
in Column2
.
To group by date and count them, I would use
df.groupby(['Column1','Column2'])['Column1'].agg({'Frequency':'count'})
and to plot them I would use matplotlib (using a bar plot):
df['Column1'].value_counts().plot.bar()
This is what I have tried, with a lot of mistakes.
Upvotes: 1
Views: 484
Reputation: 26676
chain df.assign
and str.count
to extract word count. groupby
column1
and plot either .plot,bar()
or .plot(kind='bar')
import matplotlib.pyplot as plt
(df.assign(count=df.column2.str.count('water'))).groupby('column1')['count'].sum().plot.bar()
#(df.assign(count=df.column2.str.count('water'))).groupby('column1')['count'].sum().plot(kind='bar')
plt.ylabel('Count')
plt.xlabel('Date')
Plot
Upvotes: 1
Reputation: 19610
You can use the built in string.count(substring) method for strings in Python. Then count and sum the frequency column by each day of the dates.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
df = pd.DataFrame({'Column1':['2020/05/02','2020/05/02','2020/05/03','2020/05/04','2020/05/04'],
'Column2':["She heard the gurgling water", "The water felt delightful",
"Another instant and I shall never again see the sun, this water, that gorge!",
"Fire would have been her choice.",
"Everywhere you go in world are water fountains"]})
# lazy way to convert strings to dates
df['Column1'] = pd.to_datetime(df['Column1'], infer_datetime_format=True)
pattern = "water"
df['Frequency'] = df['Column2'].apply(lambda x: x.count(pattern))
# sum the frequency of the word 'water' over each separate day
ax = df['Frequency'].groupby(df['Column1'].dt.to_period('D')).sum().plot(kind='bar')
# force integer yaxis labels
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.tick_params(axis='x', which='major', labelsize=6)
# Rotate tick marks on x-axis
plt.setp(ax.get_xticklabels(), rotation = 90)
plt.show()
Upvotes: 1
Reputation: 3908
Setup
df = pd.DataFrame({
"Column1": ["2020/05/02", "2020/05/02", "2020/05/03", "2020/05/04", "2020/05/04", "2020/05/31"],
"Column2": ["She heard the gurgling water water", "The water felt delightful", "Another instant and I shall never again see the sun, this water, that gorge!", "Fire would have been her choice.", "Everywhere you go in world are water fountains.", "She spelled 'mother' several times."]
})
Logic
# for each string, get the number of times a phrase appears
df['phrase_count'] = df['Column2'].str.count('water')
# plot the results
df.groupby('Column1')['phrase_count'].sum().plot(kind='bar')
Results
Upvotes: 2