Reputation: 4815
I have a pandas dataframe indexed by time. I want to know the total number of observations (i.e. dataframe rows) that happen each day.
Here is my dataframe:
import pandas as pd
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'],
'value': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data, columns = ['date', 'value'])
print(df)
What I want is a dataframe (or series) that looks like this:
date value
0 2014-05-01 2
1 2014-05-02 3
2 2014-05-03 2
3 2014-05-04 2
After reaching a bunch of StackOverflow questions, the closest I can get is:
df['date'].groupby(df.index.map(lambda t: t.day))
But that doesn't produce anything of use.
Upvotes: 3
Views: 4080
Reputation: 12019
Wow, @Jeff wins:
df.resample('D',how='count')
My worse answer:
The first problem is that your date column is strings, not datetimes. Using code from this thread:
import dateutil
df['date'] = df['date'].apply(dateutil.parser.parse)
Then it's trivial, and you had the right idea:
grouped = df.groupby(df['date'].apply(lambda x: x.date()))
grouped['value'].count()
Upvotes: 1
Reputation: 6383
As exp1orer mentions, you'll need to convert string date to date format. Or if you simply want to count obs but don't care date format, you can take the first 10 chars of date
column. Then use the value_counts()
method (Personally, I prefer this to groupby
+ sum
for this simple obs counts.
You can achive what you need by one liner:
In [93]: df.date.str[:10].value_counts()
Out[93]:
2014-05-02 4
2014-05-04 2
2014-05-01 2
2014-05-03 2
dtype: int64
Upvotes: 0
Reputation: 30200
I know nothing about pandas, but in Python you could do something like:
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'],
'value': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
import datetime
dates = [datetime.datetime.strptime(ts, '%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d') for ts in data['date']]
cnt = {}
for d in dates: cnt[d] = (cnt.get(d) or 0) + 1
for i, k in enumerate(sorted(cnt)):
print("%d %s %d" % (i,k,cnt[k]))
Which would output:
0 2014-05-01 2
1 2014-05-02 4
2 2014-05-03 2
3 2014-05-04 2
If you didn't care about parsing and reformatting your datetime strings, I suppose something like
dates = [d[0:10] for d in data['date']]
could replace the longer dates=...
line, but it seems less robust.
Upvotes: 0
Reputation: 35235
Use resampling. You'll need the date columns to be datetime data type (as is, they are strings) and you'll need to set it as the index to use resampling.
In [13]: df['date'] = pd.to_datetime(df['date'])
In [14]: df.set_index('date').resample('D', 'count')
Out[14]:
value
date
2014-05-01 2
2014-05-02 4
2014-05-03 2
2014-05-04 2
You can use any arbitrary function or built-in convenience functions given as strings, included 'count' and 'sum' etc.
Upvotes: 6