Mark K
Mark K

Reputation: 9358

Python, Pandas to calculate average with replicated rows

To duplicate the rows according to the value in column 'n', and reassign the value in column 'v' with the average (of v divided by n), like below:

enter image description here

I am following the sample at Replicating rows in a pandas data frame by a column value.

import pandas as pd
import numpy as np

df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [1, 2, 3],
'v' : [ 10, 13, 8]
})
df2 = df.loc[np.repeat(df.index.values,df.n)]

#pd.__version__ 0.20.3
#np.__version__ 1.15.0

But it returns me an error message:

Traceback (most recent call last):
  File "C:\Python27\Working Scripts\pv.py", line 14, in <module>
df2 = df.loc[np.repeat(df.index.values, df.n)]
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 445, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 61, in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 41, in _wrapit
result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

What goes wrong here and how can I correct it? Thank you. (Some others pandas and numpy scripts work all fine in the computer. )

Upvotes: 0

Views: 229

Answers (1)

IMCoins
IMCoins

Reputation: 3306

We usually only answer one question per thread, but you probably didn't know. For the first question, it has been answered in the comments. Casting to int32 explicitly solved your problem.

As for the average question, you can always reassign the values doing...

import pandas as pd
import numpy as np

df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [1, 2, 3],
'v' : [ 10, 13, 8]
})
df2 = df.loc[np.repeat(df.index.values,df.n)]
df2.loc[:, 'v'] = df2['v'] / df2['n']

print df2

#   id  n          v
# 0  A  1  10.000000
# 1  B  2   6.500000
# 1  B  2   6.500000
# 2  C  3   2.666667
# 2  C  3   2.666667
# 2  C  3   2.666667

I corrected the line df2['v'] = df2['v'] / df2['n'], with the .loc method which is the best practice when targeting data in pandas.

As stated in the comments, it throws a warning. You can see reading this link that this warning does false positives. As long as you know what you are doing, you should be fine. This warning is here to tell you that the method df.loc[] returns a copy of the DataFrame, and you are not using it... hence the fact that you might be doing things wrong.

tl;dr from the link, you can disable the warning doing :

pd.options.mode.chained_assignment = None # default='warn'

Upvotes: 1

Related Questions