Reputation: 1
I am trying to generate historgams of earthquake sequence using pandas.
My input is a CSV file as following :
ID,DATE,LAT,LON,DEPTH,MAG,LOCALITY
ISTerre2020odcbbh,2020-07-18T23:24:03.616341Z,45.426,6.32499,3.56121,1.56979,"MONTGELLAFREY"
ISTerre2020nsbzaa,2020-07-12T23:32:31.159491Z,45.4239,6.32597,1.79717,0.818867,"MONTGELLAFREY"
ISTerre2020lcxxda,2020-06-06T09:29:45.006351Z,45.4126,6.32702,3.7011,1.58432,"MONTGELLAFREY"
ISTerre2020jppugg,2020-05-15T23:30:27.553768Z,45.4288,6.29128,5.03303,1.0121,"LA CHAPELLE"
ISTerre2020flokvv,2020-03-18T02:46:01.877839Z,45.4134,6.38374,3.06686,1.08096,"SAINT-FRANCOIS-LONGCHAMP"
ISTerre2019znoncu,2019-12-28T11:44:51.242507Z,45.4341,6.33249,7.61996,1.26731,"EPIERRE"
I would like to insert, in the dataframe obtained with pandas, the months or days anyway missing in the catalogue (I mean days/months without earthquake) to display empty bars in the histogram for the months without events.
I try to do this with resample('M'), but it does not work, I get this error :
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Here is an example of my script :
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('catalogue.csv')
df.info()
df["DATE"] = df["DATE"].astype("datetime64")
(df["DATE"].groupby([df["DATE"].dt.year, df["DATE"].dt.month]).count()).plot(kind="bar") #to plot the hisotgram with the missing months
from datetime import datetime
from datetime import timedelta
from dateutil import rrule
data1=df.sort_values('DATE').set_index('DATE')
month_groups_resample = data1['DATE'].resample('M').count()
ax = month_groups_resample.plot(kind='bar',figsize=(10,5),legend=None)
I have made a lot of differents test around resample without any success. I am sure there is a quite simple way to do that but I am not fluent in python enough.
Hope someone can help me.
Regards Mickael.
Upvotes: 0
Views: 86
Reputation: 11
Here is a matplotlib only solution:
import datetime
from collections import Counter
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
with open("catalog.csv", "r") as _f:
# skip first header line
_f.readline()
dates = [datetime.datetime.strptime(t.split(",")[1][:7], "%Y-%m") for t in _f.readlines()]
count_dict = Counter(dates)
x = count_dict.keys()
y = [count_dict[k] for k in keys]
fig, ax1 = plt.subplots(figsize=(9, 7))
ax1.xaxis.set_major_locator(mdates.YearLocator())
ax1.xaxis.set_minor_locator(mdates.MonthLocator())
ax1.xaxis.set_major_formatter(mdates.DateFormatter("%m\n%Y"))
ax1.xaxis.set_minor_formatter(mdates.DateFormatter("%m"))
ax1.yaxis.get_major_locator().set_params(integer=True)
ax1.set_xlabel("months of measurements")
ax1.set_ylabel("count of event")
fig.suptitle("MY MAIN TITLE")
rects = ax1.bar(x, y)
plt.show()
Upvotes: 1
Reputation: 191
You can first round the dates, create an index without missing dates, and the reindex the full dataframe with this new index
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
from datetime import timedelta
from dateutil import rrule
df = pd.read_csv('catalogue.csv')
df.info()
df["DATE"] = df["DATE"].astype("datetime64")
data1 = df.sort_values('DATE').set_index('DATE')
data1.index = data1.index.round(freq='D')
# Index with all the dates
date_range = pd.date_range(
start=data1.index[0], end=data1.index[-1],
freq='D', closed='left')
# Fill the original dataframe. By default insert NaNs
data1 = data1.reindex(date_range)
# I'm using ID as representative of the number of events
df_num_events = data1.ID.groupby(level=0).count()
df_num_events.plot()
Upvotes: 0