juanmac
juanmac

Reputation: 319

Pandas: agg() gives me 'Series' objects are mutable, thus they cannot be hashed

I'm trying to agg() a df at the same time I make a subsetting from one of the columns:

indi = pd.DataFrame({"PONDERA":[1,2,3,4], "ESTADO": [1,1,2,2]})

empleo = indi.agg(ocupados = (indi.PONDERA[indi["ESTADO"]==1], sum) )

but I'm getting 'Series' objects are mutable, thus they cannot be hashed

I want to sum the values of "PONDERA" only when "ESTADO" == 1.

Expected output:

  ocupados
0     3

I'm trying to imitate R function summarise(), so I want to do it in one step and agg some other columns too.

In R would be something like:

empleo <- indi %>% 
  summarise(poblacion = sum(PONDERA),
            ocupados = sum(PONDERA[ESTADO == 1]))

Is this even the correct approach?

Thank you all in advance.

Upvotes: 0

Views: 106

Answers (4)

rhug123
rhug123

Reputation: 8768

Here are two different ways you can get the scalar value 3.

option1 = indi.loc[indi['ESTADO'].eq(1),'PONDERA'].sum()
option2 = indi['PONDERA'].where(indi['ESTADO'].eq(1)).sum()

However, your expected output shows this value in a dataframe. To do this, you can create a new dataframe with the desired column name "ocupados".

outputdf = pd.DataFrame({'ocupados':[option1]})

Based on your comment you provided, is this what you are looking for?

(indi.agg(poblacion = ("PONDERA", 'sum'),
          ocupados = ('PONDERA',lambda x: x.where(indi['ESTADO'].eq(1)).sum())))

Upvotes: 0

SeaBean
SeaBean

Reputation: 23217

A bit fancy, but the output is exactly the format you want, and the syntax is similar to what you tried:

Use DataFrameGroupBy.agg() instead of DataFrame.agg():

empleo = (indi.loc[indi['ESTADO']==1]
              .groupby('ESTADO')
              .agg(ocupados=('PONDERA', sum))
              .reset_index(drop=True)
         )

Result:

print(empleo) gives:

   ocupados
0         3

Upvotes: 0

sophocles
sophocles

Reputation: 13821

Another option would be to use loc and filter the dataframe to when estado = 1, and sum the values of the column pondera:

indi.loc[indi.ESTADO==1, ['PONDERA']].sum()

Thanks to @Henry's input.

Upvotes: 0

Georgina Skibinski
Georgina Skibinski

Reputation: 13387

Generally agg takes as an argument function, not Series itself. In your case though it's more beneficial to separate filtering and summation.

One of the options would be the following:

empleo = indi.query("ESTADO == 1")[["PONDERA"]].sum()

(Use single square brackets to output single number, instead of pd.Series)

Upvotes: 1

Related Questions