Reputation: 14801

Setting values with pandas.DataFrame

Having this DataFrame:

import pandas

dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar'])
df.set_index(['a', 'b'], inplace=True)

df

I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()).

When I try:

df['foo'][0] = df['foo'][0].resample('D').sum().dropna()

That seems to work well:

However, I get a warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

The question is, how should I do this instead?

Notes

Things I have tried but do not work (resampling or not, the assignment raises an exception):

df.iloc[0].loc['foo'] = df.iloc[0].loc['foo']
df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo']
df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']

A bit more information about the data (in case it is relevant):

The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
The real DataFrame has, of course, many more rows in it (thousands).
There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).

Using Python 3.5.1 and Pandas 0.18.1.

Upvotes: 8

Answers (3)

tmthydvnprt

Reputation: 10748

Hierarchical data in pandas

It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a,b, andbar).

Restructured Data

import pandas as pd

# Define Index
dates = pd.date_range('2016-01-01', periods=5, freq='H')
# Define Series
s = pd.Series([0, 1, 2, 3, 4], index=dates)

# Place Series in Hierarchical DataFrame
heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar'])
df = pd.DataFrame(s, columns=heirIndex)

print df

a                    1
b                    2
bar                  8
2016-01-01 00:00:00  0
2016-01-01 01:00:00  1
2016-01-01 02:00:00  2
2016-01-01 03:00:00  3
2016-01-01 04:00:00  4

Resampling

With the data in this format, resampling becomes very simple.

# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna()

print df_resampled

a            1
b            2
bar          8
2016-01-01  10

Update (from data description)

If the data has variable length Series each with a different index and non-numeric categories that is ok. Let's make an example:

# Define Series
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)

# Define Series
dates2 = pandas.date_range('2016-01-14', periods=6, freq='H')
s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2)
# Define DataFrames
df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c']))
df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))

df = pd.concat([df1, df2])
print df

a                      1      2
b                      2      5
bar                    8      5
c                   cat1   cat3
2016-01-01 00:00:00  0.0    NaN
2016-01-01 01:00:00  1.0    NaN
2016-01-01 02:00:00  2.0    NaN
2016-01-01 03:00:00  3.0    NaN
2016-01-01 04:00:00  4.0    NaN
2016-01-14 00:00:00  NaN -200.0
2016-01-14 01:00:00  NaN   10.0
2016-01-14 02:00:00  NaN   24.0
2016-01-14 03:00:00  NaN   30.0
2016-01-14 04:00:00  NaN   40.0
2016-01-14 05:00:00  NaN  100.0

The only issues is that after resampling. You will want to use how='all' while dropping na rows like this:

# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna(how='all')

print df_resampled

a              1    2
b              2    5
bar            8    5
c           cat1 cat3
2016-01-01  10.0  NaN
2016-01-14   NaN  4.0

Upvotes: 1

Dark Matter

Reputation: 200

Simply set df.is_copy = False before asignment of new value.

Upvotes: 1

user2285236

Reputation:

This should work:

df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()

Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.

Upvotes: 3