tnknepp
tnknepp

Reputation: 6263

Extracting data from Pandas dataframe as dataframe

One of the biggest problems I have experienced in Python's Pandas is the continual defaulting to pandas.core.series.Series type. e.g.

import numpy as np
import pandas as pd

a = pd.DataFrame( np.random.randn(5,5),columns=list('ABCDE') )
b = a.mean(axis=0)

>>> b
    A    0.399677
    B    0.080594
    C    0.060423
    D   -1.206630
    E    0.153359
    dtype: float64

>>> type(b)
<class 'pandas.core.series.Series'>

So, if I try to insert into a new data frame I get all sorts of errors (i.e. dimension mismatch, etc). It seems to me that when I perform an operation on a data frame the output should be a data frame, not a Series. Does anyone have a recommendation on how to use, e.g. df.mean(), and have a data frame returned?

BEGIN EDIT Sorry, I should have given more detail.
I want to selectively average slices of my original data frame, and insert these averaged values into a separate data frame.

# This is how I've been trying to do it
# Using <a> from above
b = pd.DataFrame()

# Select out data from original data frame
tmp = a(a.A>5).mean() # Just an example, this is not really my selection criteria

# Now I want to store these averaged values in my aggregated data frame.  
b = pd.concat( [b,tmp] )

I guess my real question is: How can I average data in one data frame and pass it into another for storage? END EDIT

EDIT Take 2 I have two data sets (both stored as data frames), both of which are time series. Both time series have irregular time stamps: one has a time stamp every ~90s (between hours of 0700 - 2000), the other has one or two time stamps per day (satellite overpass data)). None of the time stamps are regular (i.e. they rarely occur at the same time, and they are very rarely centered on the hour, or half hour, etc.). My goal is to take my high-frequency data and average it centered on the satellite's time stamp (+/- 30min) then store the averaged data in a new data frame. Here is the actual code I have written so far:

# OMI is the satellite data, ~daily resolution
# Pan is surface data, with 90s resolution

# Example data: 
>>> pan.head()
                        hcho     h2o      so2      o3       no2
2010-06-24 14:01:20  0.87784  2.9947      NaN     NaN  0.671104
2010-06-24 14:03:52  0.68877  3.0102      NaN     NaN  0.684615
2010-06-24 14:04:35      NaN     NaN  0.58119  285.76       NaN
2010-06-24 14:05:19  0.75813  3.0218      NaN     NaN  0.693880
2010-06-24 14:06:02      NaN     NaN  0.40973  286.00       NaN

>>> omi.head()
                    ctp  dist           no2        no2std     cf  
2010-06-24 17:51:43    7  23.8  5.179200e+15  1.034600e+15  0.001   
2010-06-26 17:39:34    3   7.0  7.355800e+15  1.158100e+15  0.113   
2010-07-01 17:57:40    9   8.4  5.348300e+15  9.286100e+14  0.040   
2010-07-03 17:45:30    5  32.2  5.285300e+15  8.877800e+14  0.000   

# Code
out = pd.DataFrame()

width = 30 # Defined earlier, input of function
for r in omi.index:
    # Define datetime limits
    d1 = r - dt.timedelta(minutes=width)
    d2 = r + dt.timedelta(minutes=width)
    tmp = pan.truncate(d1,d2).mean(axis=0,skipna=True)

    if tmp.nunique()<>0: # Ensuring there is something in <tmp>
        tmp = pd.DataFrame(tmp,index=[r],columns=pan.columns)
        out = pd.concat([out,tmp],axis=0,ignore_index=False)

Upvotes: 1

Views: 2587

Answers (1)

EdChum
EdChum

Reputation: 394459

You can just construct a DataFrame from the series easily like so:

c = DataFrame(a.mean(axis=0), columns=['mean'])
c

Out[91]:
       mean
A -0.210582
B -0.742551
C  0.347408
D  0.276034
E  0.399468

Still I don't see what this really achieves for you that is better than the original returned Series?

Upvotes: 2

Related Questions