Reputation: 1

Python pandas resampling issue / misunderstanding

I am trying to generate historgams of earthquake sequence using pandas.

My input is a CSV file as following :

ID,DATE,LAT,LON,DEPTH,MAG,LOCALITY
ISTerre2020odcbbh,2020-07-18T23:24:03.616341Z,45.426,6.32499,3.56121,1.56979,"MONTGELLAFREY"
ISTerre2020nsbzaa,2020-07-12T23:32:31.159491Z,45.4239,6.32597,1.79717,0.818867,"MONTGELLAFREY"
ISTerre2020lcxxda,2020-06-06T09:29:45.006351Z,45.4126,6.32702,3.7011,1.58432,"MONTGELLAFREY"
ISTerre2020jppugg,2020-05-15T23:30:27.553768Z,45.4288,6.29128,5.03303,1.0121,"LA CHAPELLE"
ISTerre2020flokvv,2020-03-18T02:46:01.877839Z,45.4134,6.38374,3.06686,1.08096,"SAINT-FRANCOIS-LONGCHAMP"
ISTerre2019znoncu,2019-12-28T11:44:51.242507Z,45.4341,6.33249,7.61996,1.26731,"EPIERRE"

I would like to insert, in the dataframe obtained with pandas, the months or days anyway missing in the catalogue (I mean days/months without earthquake) to display empty bars in the histogram for the months without events.

I try to do this with resample('M'), but it does not work, I get this error :

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

Here is an example of my script :

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('catalogue.csv')
df.info()
df["DATE"] = df["DATE"].astype("datetime64")
(df["DATE"].groupby([df["DATE"].dt.year, df["DATE"].dt.month]).count()).plot(kind="bar") #to plot the hisotgram with the missing months

from datetime import datetime
from datetime import timedelta
from dateutil import rrule    
    
data1=df.sort_values('DATE').set_index('DATE')  
month_groups_resample = data1['DATE'].resample('M').count()
ax = month_groups_resample.plot(kind='bar',figsize=(10,5),legend=None)

I have made a lot of differents test around resample without any success. I am sure there is a quite simple way to do that but I am not fluent in python enough.

Hope someone can help me.

Regards Mickael.

Upvotes: 0

Answers (2)

Franck T.

Reputation: 11

Here is a matplotlib only solution:

import datetime
from collections import Counter
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

with open("catalog.csv", "r") as _f:
    # skip first header line
    _f.readline()
    dates = [datetime.datetime.strptime(t.split(",")[1][:7], "%Y-%m") for t in _f.readlines()]
count_dict = Counter(dates)
x = count_dict.keys()
y = [count_dict[k] for k in keys]
    
fig, ax1 = plt.subplots(figsize=(9, 7))
ax1.xaxis.set_major_locator(mdates.YearLocator())
ax1.xaxis.set_minor_locator(mdates.MonthLocator())
ax1.xaxis.set_major_formatter(mdates.DateFormatter("%m\n%Y"))
ax1.xaxis.set_minor_formatter(mdates.DateFormatter("%m"))
ax1.yaxis.get_major_locator().set_params(integer=True)
ax1.set_xlabel("months of measurements")
ax1.set_ylabel("count of event")
fig.suptitle("MY MAIN TITLE")
rects = ax1.bar(x, y)
plt.show()

Upvotes: 1

Tojur

Reputation: 191

You can first round the dates, create an index without missing dates, and the reindex the full dataframe with this new index

import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
from datetime import timedelta
from dateutil import rrule

df = pd.read_csv('catalogue.csv')
df.info()
df["DATE"] = df["DATE"].astype("datetime64")

data1 = df.sort_values('DATE').set_index('DATE')
data1.index = data1.index.round(freq='D')

# Index with all the dates
date_range = pd.date_range(
    start=data1.index[0], end=data1.index[-1],
    freq='D', closed='left')

# Fill the original dataframe. By default insert NaNs
data1 = data1.reindex(date_range)

# I'm using ID as representative of the number of events
df_num_events = data1.ID.groupby(level=0).count()
df_num_events.plot()

Upvotes: 0

Python pandas resampling issue / misunderstanding

Answers (2)

Related Questions