TristanMatthews
TristanMatthews

Reputation: 2571

Pandas Time Series and groupby

[Edited to more clearly state root problem, which behaves differently if you use numpy 1.8 as dmvianna points out]

I have a DataFrame that has time stamps add other data. In the end I would like to not use a formatted time as the index because it messes with matplotlibs 3d plotting. I also want to preform a groupby to populate some flag fields. This is causing me to run into a number of weird errors. The first two work as I would expect. Once I bring pd.to_datetime into the picture it starts throwing errors.

runs as expected:

import pandas as pd
import numpy as np

df = pd.DataFrame({'time':np.random.randint(100000, size=1000),
                    'type':np.random.randint(10, size=1000), 
                    'value':np.random.rand(1000)})

df['high'] = 0

def high_low(group):
    if group.value.mean() > .5:
        group.high = 1
    return group

grouped = df.groupby('type')
df = grouped.apply(high_low)

works fine:

df = pd.DataFrame({'time':np.random.randint(100000, size=1000),
                    'type':np.random.randint(10, size=1000), 
                    'value':np.random.rand(1000)})

df.time = pd.to_datetime(df.time, unit='s')

df['high'] = 0

def high_low(group):
    if group.value.mean() > .5:
        group.high = 1
    return group

grouped = df.groupby('type')
df = grouped.apply(high_low)

throws error: ValueError: Shape of passed values is (3, 1016), indices imply (3, 1000)

df = pd.DataFrame({'time':np.random.randint(100000, size=1000),
                    'type':np.random.randint(10, size=1000), 
                    'value':np.random.rand(1000)})

df.time = pd.to_datetime(df.time, unit='s')
df = df.set_index('time')

df['high'] = 0

def high_low(group):
    if group.value.mean() > .5:
        group.high = 1
    return group

grouped = df.groupby('type')
df = grouped.apply(high_low)

throws error: ValueError: Shape of passed values is (3, 1016), indices imply (3, 1000)

df = pd.DataFrame({'time':np.random.randint(100000, size=1000),
                    'type':np.random.randint(10, size=1000), 
                    'value':np.random.rand(1000)})

df['epoch'] = df.time
df.time = pd.to_datetime(df.time, unit='s')
df = df.set_index('time')
df = df.set_index('epoch')

df['high'] = 0

def high_low(group):
    if group.value.mean() > .5:
        group.high = 1
    return group

grouped = df.groupby('type')
df = grouped.apply(high_low)

Anyone know what I'm missing / doing wrong?

Upvotes: 1

Views: 1202

Answers (1)

dmvianna
dmvianna

Reputation: 15718

Instead of using pd.to_datetime, I would use np.datetime64. It will work in columns and offers the same functionality as you expect from a datetime.index (np.datetime64 is a building block for datetime.index).

import numpy as np
data['time2'] = np.datetime64(data.time, 's')

Check the Docs

This would also lead to the same result:

import pandas as pd
data['time2'] = pd.to_datetime(data.time, unit='s')

Notice though that I'm using pandas 0.12.0 and Numpy 1.8.0. Numpy 1.7 has issues referred to in the comments below.

Upvotes: 2

Related Questions