sam
sam

Reputation: 494

getting standard deviation of the values in two different dataframes

I have two DataFrames and I would like to find the standard deviation per rc_id for one of the columns i.e. imapcted_userscolumn in these two dataframes and create a separate column with the name std with their standard deviation value

df1 :

data = {"timestamp":["2022-10-29","2022-10-29","2022-10-29","2022-10-29","2022-10-29","2022-10-29","2022-10-29","2022-10-29","2022-10-29"],
       "rc_id":[296,296,296,296,296,100,100,100,100],
       "impacted_users":[1,87,44,8,5,2,7,11,30]}

df1 = pd.DataFrame(data)
df1 = df1.groupby(["timestamp","rc_id"]).agg({"impacted_users": sum}
                                      ).reset_index()

df1:

    rc_id           timestamp            impacted_users
     296           2022-10-29                  145
     100           2022-10-29                   50

df2 :

    data1 = {"rc_id":[296,296,296,100,100,100],
       "impacted_users":[201,202,216,300,301,350]}

   df2 = pd.DataFrame(data1)
   df2
   

create df2:

rc_id            impacted_users
     296                201
     296                202
     296                216
     100                300
     100                301
     100                350

Expected Output:

 id           timestamp             imapcted_users  std 
 296          2022-10-29 11:00:00      145          27.21   
 100          2022-10-29 11:00:00       50          117.36

What I would like to have is std and put it as a separate columns (just as an example what values I am looking for from these columns):

std(145, 201, 202,216)
std (50,300,301,350)

I am unable to come up with a strategy to get this standard dev. for values from different dataframes. I tried to concat the required values and then get the std by aggregation but I guess there is a better way.

Upvotes: 1

Views: 39

Answers (1)

jezrael
jezrael

Reputation: 863751

IIUC use concat with aggregate std, but because pandas Series.std has default ddof=1 for expected ouput add parameter ddof=0, last append to df1:

df1 = df1.groupby(["timestamp","rc_id"], as_index=False, sort=False)["impacted_users"].sum()
              
df = (df1.join(pd.concat([df1, df2])
                 .groupby('rc_id')['impacted_users'].std(ddof=0).rename('std'), on='rc_id'))
print (df)
    timestamp  rc_id  impacted_users         std
0  2022-10-29    296             145   27.212130
1  2022-10-29    100              50  117.367745

Upvotes: 1

Related Questions