Anton
Anton

Reputation: 4815

Count the number of observations that occur per day

I have a pandas dataframe indexed by time. I want to know the total number of observations (i.e. dataframe rows) that happen each day.

Here is my dataframe:

import pandas as pd

data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'], 
        'value': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data, columns = ['date', 'value'])
print(df)

What I want is a dataframe (or series) that looks like this:

   date             value
0  2014-05-01       2
1  2014-05-02       3
2  2014-05-03       2
3  2014-05-04       2

After reaching a bunch of StackOverflow questions, the closest I can get is:

df['date'].groupby(df.index.map(lambda t: t.day))

But that doesn't produce anything of use.

Upvotes: 3

Views: 4080

Answers (4)

exp1orer
exp1orer

Reputation: 12019

Wow, @Jeff wins:

df.resample('D',how='count')

My worse answer:

The first problem is that your date column is strings, not datetimes. Using code from this thread:

import dateutil
df['date'] = df['date'].apply(dateutil.parser.parse)

Then it's trivial, and you had the right idea:

grouped = df.groupby(df['date'].apply(lambda x: x.date()))
grouped['value'].count()

Upvotes: 1

Happy001
Happy001

Reputation: 6383

As exp1orer mentions, you'll need to convert string date to date format. Or if you simply want to count obs but don't care date format, you can take the first 10 chars of date column. Then use the value_counts() method (Personally, I prefer this to groupby + sum for this simple obs counts.

You can achive what you need by one liner:

In [93]: df.date.str[:10].value_counts()
Out[93]: 
2014-05-02    4
2014-05-04    2
2014-05-01    2
2014-05-03    2
dtype: int64

Upvotes: 0

jedwards
jedwards

Reputation: 30200

I know nothing about pandas, but in Python you could do something like:

data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'],
        'value': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

import datetime
dates = [datetime.datetime.strptime(ts, '%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d') for ts in data['date']]

cnt = {}
for d in dates: cnt[d] = (cnt.get(d) or 0) + 1

for i, k in enumerate(sorted(cnt)):
    print("%d  %s  %d" % (i,k,cnt[k]))

Which would output:

0  2014-05-01  2
1  2014-05-02  4
2  2014-05-03  2
3  2014-05-04  2

If you didn't care about parsing and reformatting your datetime strings, I suppose something like

dates = [d[0:10] for d in data['date']]

could replace the longer dates=... line, but it seems less robust.

Upvotes: 0

Dan Allan
Dan Allan

Reputation: 35235

Use resampling. You'll need the date columns to be datetime data type (as is, they are strings) and you'll need to set it as the index to use resampling.

In [13]: df['date'] = pd.to_datetime(df['date'])

In [14]: df.set_index('date').resample('D', 'count')
Out[14]: 
            value
date             
2014-05-01      2
2014-05-02      4
2014-05-03      2
2014-05-04      2

You can use any arbitrary function or built-in convenience functions given as strings, included 'count' and 'sum' etc.

Upvotes: 6

Related Questions